200 likes | 323 Views
RDF languages and storages part 2 - indexing semi-structure data. Maciej Janik Conrad Ibanez CSCI 8350, Fall 2004. Outline. Jena storage Indexing techniques. Jena. Implemented in Java One of the most popularly used RDF storages and query engines Supports RDF, RDFS and OWL
E N D
RDF languages and storagespart 2 - indexing semi-structure data Maciej JanikConrad IbanezCSCI 8350, Fall 2004
Outline • Jena storage • Indexing techniques
Jena • Implemented in Java • One of the most popularly used RDF storages and query engines • Supports RDF, RDFS and OWL • In memory and persistent storage (Oracle, MySQL, PostgreSQL) • RDQL • Reasoning/inference engine
Jena - storage schema • Previous version used normalized relational DB tables • statements • literals • resources • Taken approach to store triples as (Subject, Predicate, Object) in denormalized tables • Optimization for common statement patterns - grouping of properties
Normalized tables Denormalized Jena - storage „Efficient RDF Storage and Retrieval in Jena2” - Wilkinson et al.
Jena - storage • Do certain trade-off for space and search time • Cluster properties that are likely to be accessed together - optimize for common patterns • Special treatment of reified statements
Jena - graph abstraction • Graph interface is separated from (persistent) triple storage layer • Special support for different types of graphs - optimized for performance • Support operations like add, delete, find.
Jena - query processing • Converting multiple patterns in query into one query to DB • Use DB query optimizer instead of executing multiple queries from Jena level (as it was in Jena1) • Associate a table with pattern (best) or span pattern between tables (requires join operation) • Query may span between different graphs, but it can be optimized only if they are in the same database
Indexing semistructured data • XML cannot be indexed directly as relational DB • Indexing may take advantage of tree structure • depth of node • common path from the root • convert each path to string expression • precalculate the path tree
Indexing semistructured data • Idea is based on Particia’s trie • Index should scale with the growth of data • Path together with leaf is encoded into string -> the Index Fabric „A Fast Index for Semistructured Data” - Brian F. Cooper et al.
A Layered Index „A Fast Index for Semistructured Data” - Brian F. Cooper et al.
Index Fabric • Index is used to accelerate path expressions - mainly for queries that ask for root-to-leaf path • Idea of prefix encoding • xml: <A>alpha<B>beta<C>gamma</C></B></A> • paths: <A>alpha ; <A><B>beta ; <A><B><C>gamma • encoded: A alpha ; A B beta ; A B C gamma • infix (not common): A alpha B beta C gamma • Convert path to string for fast searches • Replace tags with ‘non-terminal’ characters (like in automata)
Index Fabric - raw paths „A Fast Index for Semistructured Data” - Brian F. Cooper et al.
Graphs - how to index? http://www.aisee.com/ Backbone
Graphs - how to index? http://www.aisee.com/ Tree-type - prefixes - tries
2-index 1-index Graphs - how to index? T-index Path templates „Index Structure for Path Expressions” - Tova Milo, Dan Suciu
Graphs - how to index? http://www.aisee.com/ Landmarks
Indexing - summary • Indexing semistructure data • index fabric - encoding, multilayered • common prefixes - trie structure • backbone - highways between points • landmarks - county division • path templates - precalculated expressions • clustering - grouping by theme access • Indexing such data is NOT easy, solution depends how you want to search the graph
References • „Efficient RDF Storage and Retrieval in Jena2” - Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds • „A Fast Index for Semistructured Data” - Brian F. Cooper, Neal Sample, Michael J. Franklin, Gisli Hjaltason, Moshe Shadmon • „Index Structures for Path Expressions” - Tova Milo, Dan Suciu