Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data Medha Atre1, Vineet Chaoji2, Mohammed J. Zaki1, and James A. Hendler1 1Dept. of Computer Science, Rensselaer Polytechnic Institute, Troy NY, USA 2Yahoo! Labs, Bangalore, India April 29, 2010 WWW 2010, Raleigh NC, USA

Overview • Introduction • Challenges • Motivation • BitMat structure • Construction & operations • Query processing algorithm • Experimental evaluation • Future roadmap WWW 2010, Raleigh NC, USA

Introduction • RDF (Resource Description Framework) for representing any information • triple form – [<subject> <predicate> <object>] • Depicted as a directed edge • RDF graphs of hundreds of millions of triples to a few billion triples are common nowadays • DBPedia (103 million triples) • UniProt (845 million triples) • US Census (1 billion triples) • Bio2RDF (2.3 billion triples) • Data.gov (5+ billion triples) Subject Predicate Object WWW 2010, Raleigh NC, USA

Challenges – Storing RDF Data • RDF graphs of more than a billion triples (400 GB+ on-disk size). • Traditional DB based efforts • Jena-TDB (custom indexes and storage) • C-store • MonetDB (open-source DB system) • Exploit RDF data characteristics on top of DB storage • Vertical partitioning: create separate predicate table for each predicate. • Compression based techniques • MonetDB and RDF-3X WWW 2010, Raleigh NC, USA

Challenges – Querying RDF Data • Limited main memory compared to disk space • Large intermediate join tables • Scans over large percentage of indexes • Even for aggressive indexing + compression. E.g. Hexastore, RDF-3X • Optimizations • Selectivity estimation in case of multiple level joins, left deep join tree • Sideways (parallel) information passing for several merge-joins • Semi-joins: Semi-joins reduce the database for a given join query WWW 2010, Raleigh NC, USA

Motivation for this work • SPARQL join queries can be broadly classified into 3 types: • Queries having highly selective triple patterns, e.g., (?s :residesIn USA)(?s :hasSSN “123-45-6789”) • Existing techniques handle these queries very efficiently • Queries with low-selectivity triple patterns but highly selective results, e.g., (?s :residesIn China)(?s :citizenOf India) • Queries with low-selectivity triple patterns and low-selectivity results, e.g., (?s :residesIn USA)(?s :hasSSN ?y) • Such queries involving multi-level joins can lead to large intermediate results WWW 2010, Raleigh NC, USA

Our Contribution • A compressed data structure – BitMat to store the RDF data • A join query algorithm which operates directly on the compressed data: • No intermediate join tables, instead, use a 2-phase query algorithm • First phase: prune the candidate RDF triples • Second phase: stitch the final results directly from the pruned triples • Can guaranty memory requirements at the beginning of the query • Online/streaming result generation WWW 2010, Raleigh NC, USA

BitMat Construction • Conceptually construct a bit-cube of subject (S), predicate (P), object (O) dimensions • Mapping dictionary: • Vs: Set of subjects, Vp: Set of predicates, Vo: Set of objects, Vso= Vs Vo • Common subject and object URIs mapped to same integer IDs 1 to |Vso| • Subject only URIs mapped to integer IDs |Vso|+1 to |Vs| S-dimension Vs P-dimension Vso 1 1 Vso O-dimension WWW 2010, Raleigh NC, USA Vo

BitMat Construction (continued..) S-dimension P3 • Slice along P dimension and store S-O and O-S BitMats • Apply gap-encoding to each row of the BitMat before storing it • Storage: 2 |Vp| + |Vs| + |Vo| BitMats • Additionally store condensed representation of rows and columns and number of triples in each of the 4 types of BitMats 0 0 0 P2 1 0 0 0 1 0 0 P1 1 0 0 1 0 SO2 0 1 0 0 SO1 0 1 0 SO1 0 P-dimension SO2 0 O3 O-dimension O4 WWW 2010, Raleigh NC, USA

Operations on BitMat • Join algorithm uses two basic operations: fold & unfold • fold(BitMat, dimension) returns bitArray • Folds the input BitMat by retaining the dimension • unfold(BitMat, MaskBitArray, dimension) • Unfolds MaskBitArray on the BitMat in dimension • Fold & unfold operate by doing bitwise AND/OR operations on gap compressed bit-vectors 1 1 1 1 1 1 1 1 1 1 1 1 1 1 WWW 2010, Raleigh NC, USA

Query Processing Algorithm • Build a constraint graph • E.g., query (?m rdf:type :movie)(?n rdf:type movie)(?m :similar_to :n) has constraint graph as • Each triple pattern has a BitMat containing only triples matching that triple pattern • Propagate the constraints on join variable bindings imposed by each triple pattern Gjvar ?m ?n ?m rdf:type :movie ?m :similar_to ?n ?n rdf:type :movie Gtp SS SO WWW 2010, Raleigh NC, USA

Phase 1 -- Pruning phase • Embed a tree on the subgraph Gjvar • Walk over this tree from root to leaves and back in BFS order • At each node in the tree over Gjvar, collect all the variable bindings from the BitMats of the triple patterns containing that variable (fold operation) • Do a bitwise AND of all folded bit-arrays obtained • Relay back the results of bitwise AND on the BitMats (unfold operation) • Simple optimizations: • Tree root selection: Select the join variable having the least number of triples in their BitMats as the root of the tree over Gjvar • Early stopping: If at any point, the result of bitwise AND of folded bit-arrays is null WWW 2010, Raleigh NC, USA

Pruning phase 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ?m ?n ?m fold unfold fold unfold fold unfold fold unfold 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ?m rdf:type :movie ?m :similar_to ?n ?n rdf:type :movie In the reverse traversal while propagating effect of join over “?n”, the fold of 2nd BitMat yields same bit-array as the mask bit-array of ?m before, hence there is no need to do fold/unfold again on the first BitMat WWW 2010, Raleigh NC, USA

Phase 2 -- Final result generation • Resembles a multi-way join • Start with the triple pattern with least number of triples left in its BitMat • Generate bindings for variables in that triple pattern • Next, select another triple pattern which shares a join variable with any of the previously selected triple patterns • Check if it can generate the same bindings for the shared join variable and generate bindings for its other variables • Continue this and at the end of one round when all triple patterns are processed and all variables have consistent bindings, output the result WWW 2010, Raleigh NC, USA

Final result generation Sample query ?a rdf:type :Person ?a :worksFor ?b ?c :departmentOf ?b ?a rdf:type :Person 1 1 1 ?a 1 1 1 1 ?b ?a :worksFor ?b :s1 1 1 1 1 ?a 1 1 :o3 :o2 Output this result 1 :t3 :t3 :t4 ?b ?c :departmentOf ?b 1 1 ?c 1 1 1 1 1 1 1 WWW 2010, Raleigh NC, USA

Evaluation setup • Competitive RDF stores: • MonetDB • RDF-3X • Datasets: • UniProt: Protein dataset with ~845M triples, ~147M subjects, 95 predicates, and ~128M objects • LUBM: Synthetic university dataset with ~1.33B triples, ~217M subjects, 18 predicates, and ~161M objects • Queries: • UniProt: Queries published by UniProt dataset owners and RDF-3X • LUBM: Queries published by OpenRDF • Development environment: • Dell Optiplex 755 PC, 3.0 GHz Intel E6850 Core 2 Duo Processor, 4 GB memory. • 7 GB swap space on 7200 rpm 1 TB disk. • 64 bit 2.6.28-15 Linux kernel (Ubuntu 9.04 distribution). WWW 2010, Raleigh NC, USA

Results • For queries with low-selectivity triple patterns, BitMat outperformed MonetDB and RDF-3X by 2-3 orders of magnitude • For highly selective triple patterns, RDF-3X gave superior performance, especially for queries where sideways-information-passing (SIP) could benefit • BitMat’s shortcomings in case of highly selective queries: • The 2-phase query processing can create additional overheads for highly selective queries • No cache memory optimization • No memory mapping of disk files WWW 2010, Raleigh NC, USA

UniProt 845 million triples (time in sec) More results in the paper WWW 2010, Raleigh NC, USA

LUBM 1.33 billion triples (time in sec) WWW 2010, Raleigh NC, USA

Comparison of index storage space WWW 2010, Raleigh NC, USA

Future Roadmap • Does not allow a subset of variables to be specified by the SELECT clause • Does not have ability to process other class of SPARQL queries, e.g., OPTIONAL, UNION, FILTER etc. • S-P or P-O dimensional joins not handled • Rare in assertional RDF data • Cannot perform addition/deletion/update of triples • Incorporate lazy-loading of BitMats to avoid overheads for highly selective queries WWW 2010, Raleigh NC, USA

Thank you! WWW 2010, Raleigh NC, USA

Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

Presentation Transcript

Dense Matrix Algorithms

Dense Matrix Algorithms

Scalable Data Mining

CRUD Matrix

Displaying Selected Data with Queries

QUERY OPTIMIZATION AND QUERY PROCESSING

Query Processing: Query Formulation

Distributed Databases

SPARQL

CPS216: Advanced Database Systems Notes 06:Query Execution (Sort and Join operators)

6 . Distributed Query Optimization

Outline

Structured Query Language (SQL)

Lecture 7: Query Execution

ARM Processor Architecture

Data Mining 2

Scalable Web Architectures

Welcome to Matrix Controls and MKMS Matrix Knit Monitoring System

Chapter 20

INTRODUCTION TO PEOPLESOFT QUERY

What’s in This Module?