Storage Engine for Semantic Web

Storage Engine for Semantic Web

Assertion • Storage engine for semantic web has requirements similar to those for e-commerce aplications. • Draw upon results and lessons from • R. Agrawal, A. Somani, Y. Xu: Storage and Retrieval of E-Commerce Data. VLDB-2001.

Typical E-Commerce Data Characteristics An Experimental E-marketplace for Computer components • Constantly evolving schema • Sparsely populated data (about 50-100 attributes/component) • Nearly 2 Million components • More than 2000 leaf-level categories • Large number of Attributes (5000)

Alternative Physical Representations • Horizontal • One N-ary relation • Binary • N 2-ary relations • Vertical • One 3-ary relation

Conventional horizontal representation (n-ary relation) • DB Catalogs do not support thousands of columns (DB2/Oracle limit: 1012 columns) • Storage overhead of NULL values Nulls increase the index size and they sort high in DB2 B+ tree index • Hard to load/update • Schema evolution is expensive • Querying is straightforward

Monitor Name Val Height Output PAN DVD-L75 7 inch Name Val Name Val 3.75 KLH DVD221 Digital PAN DVD-L75 KLH DVD221 S-Video Binary Representation(N 2-ary relations) • Dense representation • Manageability is hard because of large number of tables • Schema evolution expensive • Decomposition Storage Model [Copeland et al SIGMOD 85], [Khoshafian et al ICDE 87] • Monet: Binary Attribute Tables [Boncz et al VLDB Journal 99] • Attribute Approach for storing XML Data [Florescu et al INRIA Tech Report 99]

Oid (object identifier) Key (attribute name) Val (attribute value) Vertical representation(One 3-ary relation) • Objects can have large number of attributes • Handles sparseness well • Schema evolution is easy • Implementation of SchemaSQL [LSS 99] • Edge Approach for storing XML Data [FK 99]

Querying over Vertical Representation is Complex • Simple query on a Horizontal scheme SELECT MONITOR FROM H WHERE OUTPUT=‘Digital’Becomes quite complex: SELECT v1.Val FROM vtable v1, vtable v2 • WHERE v1.Key = ‘Monitor’ • AND v2.Key = ‘Output’ • AND v2.Val = ‘Digital’ • AND v1.Oid = v2.Oid Writing applications becomes much harder. What can we do ?

Horizontal view (H) Attr1 Attr2 … Attrk … Query Mapping Layer Vertical table (V) Oid Key Val Solution • Provide horizontal view of the vertical table • Translation layer automatically maps operations on H to operations on V

Transformation Algebra • Defined an algebra for transforming expressions over horizontal views into expressions over the vertical representation. • Two key operators: • v2h () • h2v ()

Sample Algebraic Transforms • v2h () Operation – Convert from vertical to horizontal k(V) = [Oid(V)]  [i=1,kOid,Val(Key=‘Ai’(V))] • h2V () Operation – Convert from horizontal to vertical k(H) = [i=1,kOid,’Ai’Ai(Ai  ‘’(V))]  [i=1,k Oid,’Ai’Ai(i=1,kAi=‘’(V)) • Similar operations such as Unfold/Fold and Gather/Scatter exist in SchemaSQL [LSS 99] and [STA 98] respectively • Complete transforms in VLDB-2001 Paper

From the Algebra to SQL • Equivalent SQL transforms for algebraic transforms • Select, Project • Joins (self, two verticals, a horizontal and a vertical) • Cartesian Product • Union, Intersection, Set difference • Aggregation • Extend DDL to provide the Horizontal View CREATE HORIZONTAL VIEW hview ON VERTICAL TABLE vtable USING COLUMNS (Attr1, Attr2, … Attrk, …)

Alternative Implementation Strategies • VerticalSQL • Uses only SQL-92 level capabilities • VerticalUDF • Exploits User Defined Functions and Table Functions to provide a direct implementation • Binary (hand-coded queries) • 2-ary representation with one relation per attribute (using only SQL-92 transforms)

density = 10%, 1000 cols x 20K rows 25 20 VerticalSQL_oid 15 Execution time (seconds) VerticalSQL_key 10 5 0 0.1% 1% 5% Join selectivity Data Organization Matters: Clustering by Key significantly outperforms by Oid Join

density = 10% 60 50 40 HorizontalSQL Execution time (seconds) 30 VerticalSQL 20 Binary 10 0 200x100K 400x50K 800x25K 1000x20K Table (#cols x #rows) • VerticalSQL comparable to Binary and outperforms Horizontal Projection of 10 columns

30 20 VerticalSQL Execution time (seconds) Binary 10 VerticalUDF 0 200x100K 400x50K 800x25K 1000x20K Table (#cols x #rows) • VerticalUDF is the best approach density = 10% Projection of 10 columns

Summary Vertical (w/ Mapping) Binary (w/ Mapping) Horizontal Manageability + + - Flexibility - + - + Querying + + - Performance + +

Remarks • Lessons of this study directly apply to building storage engine for semantics webs • Performance of vertical representation can be further improved by: • Enhanced table functions • First class treatment of table functions • Native support for v2h and h2v operations • Partial indices

Storage Engine for Semantic Web