130 likes | 461 Views
RDF-3X : RISC-Style RDF Database Engine . Thomas Neumann, Gerhard Weikum PVLDB 2008 09 Jan 2014 SNU IDB Lab. Woo Hyun Lee. Introduction. IS. IS. Schema-less & Not normalized. EXPENSIVE. RDF. Needs. IsFor. SEMANTIC WEB. Effective System. IS. Effective & Efficient. IS.
E N D
RDF-3X : RISC-Style RDF Database Engine Thomas Neumann, Gerhard WeikumPVLDB 2008 09 Jan 2014 SNU IDB Lab. Woo Hyun Lee
Introduction IS IS Schema-less & Not normalized EXPENSIVE RDF Needs IsFor SEMANTIC WEB Effective System IS Effective & Efficient IS RDF-3X
RDF-3X*: Existing RDF Systems • Triples are stored in RDB • Type1: Triples Table • All in a single table with 3 columns (Subject, Predicate, Object) • Type2: Property Table • Grouped by predicates • Type3: Cluster-Property Table • Clustered by correlated predicates, entity class, occurrence Statistics RDF Triples Capital * Thomas Neumann, Gerhard Weikum, RDF-3X : a RISC-style Engine for RDF, PVLDB ‘08
RDF-3X: RDF Storage • Huge Triples Table • All triples stored in a clustered B+ tree in lexicographical order • Fast Range scan • Mapping Dictionary • All literals are mapped to IDs • Compressed • Simple query processing
RDF-3X: RISC-Style RDF Storage • RISC (Reduced Instruction Set Computing) • Proposed by John Cocke (IBM) in 1974 • 20% of all instructions does 80% of the work • Use simple instructions • Simplification leads to more intuitive processing and less overhead • RISC-Style RDF Storage • Reduced Complexity • Mapping Dictionary • Convert literals to Integer-based IDs • Compare IDs • Produce streams of ID tuples • Compressed triples
RDF-3X: Compressed Index • Six separate indexes • (SPO, SOP, OSP, OPS, PSO, POS) • Stored in the leaf pages of the clustered B+ tree Triple Index ?var- <P> - <O> SPO SOP PSO SPO SOP PSO <S> - var?- <O> <S> - <P> - var? POS OSP OPS ?var- <P> - var? POS OSP OPS ∙ ∙ ∙
RDF-3X : Compressed Index SPO <Malaysia> <Capital> <Kuala Lumpur> <Malaysia> <Kuala Lumpur> <Capital> SOP PSO <Capital> <Malaysia> <Kuala Lumpur> POS <Capital> <Kuala Lumpur> <Malaysia> OSP <Kuala Lumpur> <Malaysia> <Capital> OPS <Kuala Lumpur> <Capital> <Malaysia> SPARQL ?X- <Capital> - <Kuala Lumpur> <Malaysia>- ?X- <Kuala Lumpur>
RDF-3X: Compressed Index • Store collation order • Neighboring indexes are very similar • Stores the change between triples Compression using LZ77
RDF-3X: Compressed Index • Compression • Stores only the change (δ) between triples . . . . B+ Tree Index . . . . <Malaysia, Capital, K-L> <Malaysia, Capital, KL> <Malaysia, Capital, Kuala Lumpur> <651, 954, 260> <651, 954, 270> <651, 954, 275> Compressed B+ Tree Index
RDF-3X: Compressed Indexes • Leaf level Compression • Directly read triple (less decompression cost) • Easy update • Better concurrency control and recovery <Previous Approach> <RISC Style Approach> Leaf Level Compressed B+ Tree Index Chunk Level Compressed B+ Tree Index . . . . . . . .
RDF-3X: Aggregated Indexes [1/2] • For many SPARQL patterns • Indexing partial triples rather than full triples would be sufficient • SELECT ?a ?b WHERE { ?a ?b ?c } • Two-value indexes • Two of three columns of a triple • (value1, value2, count) • (SP, PS, SO, OS, PO, OP) • One-value indexes • One of three columns of a triple • (value, count) • (S, P, O)
RDF-3X: Aggregated Indexes [2/2] SELECT ?a ?c WHERE{ ?a ?b ?c. } Aggregated Indexes Aggregated Indexes
Conclusion • Redundancy • From simple single index to multiple duplicated indices • Complex compression algorithm • High computation cost • Need for Index • Cheaper & Faster Hardware • Reliability • Not verified