1 / 22

Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data. Medha Atre 1 , Vineet Chaoji 2 , Mohammed J. Zaki 1 , and James A. Hendler 1 1 Dept. of Computer Science, Rensselaer Polytechnic Institute, Troy NY, USA 2 Yahoo! Labs, Bangalore, India April 29, 2010

carson
Download Presentation

Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF Data Medha Atre1, Vineet Chaoji2, Mohammed J. Zaki1, and James A. Hendler1 1Dept. of Computer Science, Rensselaer Polytechnic Institute, Troy NY, USA 2Yahoo! Labs, Bangalore, India April 29, 2010 WWW 2010, Raleigh NC, USA

  2. Overview • Introduction • Challenges • Motivation • BitMat structure • Construction & operations • Query processing algorithm • Experimental evaluation • Future roadmap WWW 2010, Raleigh NC, USA

  3. Introduction • RDF (Resource Description Framework) for representing any information • triple form – [<subject> <predicate> <object>] • Depicted as a directed edge • RDF graphs of hundreds of millions of triples to a few billion triples are common nowadays • DBPedia (103 million triples) • UniProt (845 million triples) • US Census (1 billion triples) • Bio2RDF (2.3 billion triples) • Data.gov (5+ billion triples) Subject Predicate Object WWW 2010, Raleigh NC, USA

  4. Challenges – Storing RDF Data • RDF graphs of more than a billion triples (400 GB+ on-disk size). • Traditional DB based efforts • Jena-TDB (custom indexes and storage) • C-store • MonetDB (open-source DB system) • Exploit RDF data characteristics on top of DB storage • Vertical partitioning: create separate predicate table for each predicate. • Compression based techniques • MonetDB and RDF-3X WWW 2010, Raleigh NC, USA

  5. Challenges – Querying RDF Data • Limited main memory compared to disk space • Large intermediate join tables • Scans over large percentage of indexes • Even for aggressive indexing + compression. E.g. Hexastore, RDF-3X • Optimizations • Selectivity estimation in case of multiple level joins, left deep join tree • Sideways (parallel) information passing for several merge-joins • Semi-joins: Semi-joins reduce the database for a given join query WWW 2010, Raleigh NC, USA

  6. Motivation for this work • SPARQL join queries can be broadly classified into 3 types: • Queries having highly selective triple patterns, e.g., (?s :residesIn USA)(?s :hasSSN “123-45-6789”) • Existing techniques handle these queries very efficiently • Queries with low-selectivity triple patterns but highly selective results, e.g., (?s :residesIn China)(?s :citizenOf India) • Queries with low-selectivity triple patterns and low-selectivity results, e.g., (?s :residesIn USA)(?s :hasSSN ?y) • Such queries involving multi-level joins can lead to large intermediate results WWW 2010, Raleigh NC, USA

  7. Our Contribution • A compressed data structure – BitMat to store the RDF data • A join query algorithm which operates directly on the compressed data: • No intermediate join tables, instead, use a 2-phase query algorithm • First phase: prune the candidate RDF triples • Second phase: stitch the final results directly from the pruned triples • Can guaranty memory requirements at the beginning of the query • Online/streaming result generation WWW 2010, Raleigh NC, USA

  8. BitMat Construction • Conceptually construct a bit-cube of subject (S), predicate (P), object (O) dimensions • Mapping dictionary: • Vs: Set of subjects, Vp: Set of predicates, Vo: Set of objects, Vso= Vs Vo • Common subject and object URIs mapped to same integer IDs 1 to |Vso| • Subject only URIs mapped to integer IDs |Vso|+1 to |Vs| S-dimension Vs P-dimension Vso 1 1 Vso O-dimension WWW 2010, Raleigh NC, USA Vo

  9. BitMat Construction (continued..) S-dimension P3 • Slice along P dimension and store S-O and O-S BitMats • Apply gap-encoding to each row of the BitMat before storing it • Storage: 2 |Vp| + |Vs| + |Vo| BitMats • Additionally store condensed representation of rows and columns and number of triples in each of the 4 types of BitMats 0 0 0 P2 1 0 0 0 1 0 0 P1 1 0 0 1 0 SO2 0 1 0 0 SO1 0 1 0 SO1 0 P-dimension SO2 0 O3 O-dimension O4 WWW 2010, Raleigh NC, USA

  10. Operations on BitMat • Join algorithm uses two basic operations: fold & unfold • fold(BitMat, dimension) returns bitArray • Folds the input BitMat by retaining the dimension • unfold(BitMat, MaskBitArray, dimension) • Unfolds MaskBitArray on the BitMat in dimension • Fold & unfold operate by doing bitwise AND/OR operations on gap compressed bit-vectors 1 1 1 1 1 1 1 1 1 1 1 1 1 1 WWW 2010, Raleigh NC, USA

  11. Query Processing Algorithm • Build a constraint graph • E.g., query (?m rdf:type :movie)(?n rdf:type movie)(?m :similar_to :n) has constraint graph as • Each triple pattern has a BitMat containing only triples matching that triple pattern • Propagate the constraints on join variable bindings imposed by each triple pattern Gjvar ?m ?n ?m rdf:type :movie ?m :similar_to ?n ?n rdf:type :movie Gtp SS SO WWW 2010, Raleigh NC, USA

  12. Phase 1 -- Pruning phase • Embed a tree on the subgraph Gjvar • Walk over this tree from root to leaves and back in BFS order • At each node in the tree over Gjvar, collect all the variable bindings from the BitMats of the triple patterns containing that variable (fold operation) • Do a bitwise AND of all folded bit-arrays obtained • Relay back the results of bitwise AND on the BitMats (unfold operation) • Simple optimizations: • Tree root selection: Select the join variable having the least number of triples in their BitMats as the root of the tree over Gjvar • Early stopping: If at any point, the result of bitwise AND of folded bit-arrays is null WWW 2010, Raleigh NC, USA

  13. Pruning phase 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ?m ?n ?m fold unfold fold unfold fold unfold fold unfold 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ?m rdf:type :movie ?m :similar_to ?n ?n rdf:type :movie In the reverse traversal while propagating effect of join over “?n”, the fold of 2nd BitMat yields same bit-array as the mask bit-array of ?m before, hence there is no need to do fold/unfold again on the first BitMat WWW 2010, Raleigh NC, USA

  14. Phase 2 -- Final result generation • Resembles a multi-way join • Start with the triple pattern with least number of triples left in its BitMat • Generate bindings for variables in that triple pattern • Next, select another triple pattern which shares a join variable with any of the previously selected triple patterns • Check if it can generate the same bindings for the shared join variable and generate bindings for its other variables • Continue this and at the end of one round when all triple patterns are processed and all variables have consistent bindings, output the result WWW 2010, Raleigh NC, USA

  15. Final result generation Sample query ?a rdf:type :Person ?a :worksFor ?b ?c :departmentOf ?b ?a rdf:type :Person 1 1 1 ?a 1 1 1 1 ?b ?a :worksFor ?b :s1 1 1 1 1 ?a 1 1 :o3 :o2 Output this result 1 :t3 :t3 :t4 ?b ?c :departmentOf ?b 1 1 ?c 1 1 1 1 1 1 1 WWW 2010, Raleigh NC, USA

  16. Evaluation setup • Competitive RDF stores: • MonetDB • RDF-3X • Datasets: • UniProt: Protein dataset with ~845M triples, ~147M subjects, 95 predicates, and ~128M objects • LUBM: Synthetic university dataset with ~1.33B triples, ~217M subjects, 18 predicates, and ~161M objects • Queries: • UniProt: Queries published by UniProt dataset owners and RDF-3X • LUBM: Queries published by OpenRDF • Development environment: • Dell Optiplex 755 PC, 3.0 GHz Intel E6850 Core 2 Duo Processor, 4 GB memory. • 7 GB swap space on 7200 rpm 1 TB disk. • 64 bit 2.6.28-15 Linux kernel (Ubuntu 9.04 distribution). WWW 2010, Raleigh NC, USA

  17. Results • For queries with low-selectivity triple patterns, BitMat outperformed MonetDB and RDF-3X by 2-3 orders of magnitude • For highly selective triple patterns, RDF-3X gave superior performance, especially for queries where sideways-information-passing (SIP) could benefit • BitMat’s shortcomings in case of highly selective queries: • The 2-phase query processing can create additional overheads for highly selective queries • No cache memory optimization • No memory mapping of disk files WWW 2010, Raleigh NC, USA

  18. UniProt 845 million triples (time in sec) More results in the paper WWW 2010, Raleigh NC, USA

  19. LUBM 1.33 billion triples (time in sec) WWW 2010, Raleigh NC, USA

  20. Comparison of index storage space WWW 2010, Raleigh NC, USA

  21. Future Roadmap • Does not allow a subset of variables to be specified by the SELECT clause • Does not have ability to process other class of SPARQL queries, e.g., OPTIONAL, UNION, FILTER etc. • S-P or P-O dimensional joins not handled • Rare in assertional RDF data • Cannot perform addition/deletion/update of triples • Incorporate lazy-loading of BitMats to avoid overheads for highly selective queries WWW 2010, Raleigh NC, USA

  22. Thank you! WWW 2010, Raleigh NC, USA

More Related