Principles of Ontology Storage, Reasoning and Query

Principles of Ontology Storage, Reasoning and Query 马力 MALLI@cn.ibm.com IBM China Research Lab

A Review of Previous Class • What we have learned about triple store? • What reasoning algorithms did you learn before? • What is SPARQL? • How does it work?

Objectives of This Class • Master methods to store RDF and OWL data; • Learn optimization schemes for triple store; • Learn basic methods for ontology reasoning; • Learn schemes to scale up classic DL reasoner; • Learn a method to extend DLs with constraints; • Understand the semantics and complexity of SPARQL; • Learn a relational algebra for SPARQL • Learn how these three parts work together in a commercial system

Outline • Ontology Storage • Dimension table • Jena’s property table • Binary store vs Improved Generic Store • Ontology Reasoning • Rule Inference • Forward chaining • Backward chaining • An Efficient Scheme to scale up classic DL reasoner • A method to extend DLs with constraint • Ontology Query • Semantics and Complexity of SPARQL • Relational algebra for SPARQL • A Complete System: Oracle RDF Store

Triple Store: RDF Data Model • Data model for expressing knowledge • basic building block: statement <person001> <name> “Jeen” . • groups of statements form graphs name Jeen person001 email j.broekstra@tue.nl worksIn projectMemberEmail name project001 SOR

Triple Store: RDFS rdfs:Class • RDF Schema is a Vocabulary Description Language • it allows specification of domain vocabulary and a way to structure it • Class, Property, subClassOf, subPropertyOf, domain, range • Formal semantics add simple reasoning capabilities: • class and property subsumption • domain and range inference rdf:type rdf:Property Person rdf:type rdfs:domain rdfs:subClassOf name Researcher rdf:type person001

Triple Store: Generic Storage Model Triples table • A triple table of three relational columns, (Subject, Property, Object) • A Resource table to encode all URIs. • Features: • efficient in space • retrieval requires 3-way join Symbols table

Triple Store: Improved Generic Storage Model • RDFS and OWL ontology is separated from instances. • Features • Efficient management to ontology.

Triple Store: Binary Storage Model • Each class and property has a separate table. • Features • Higher performance by avoiding a huge triple table. • An ontology with hundreds of thousands classes and properties creates overhead to RDBMS.

An Example: IBM SOR’s Schema

Jena’s Property Table A table that stores patterns of RDF statements • n-column prop table stores n-1 statements (1 col per prop) • Augments, doesn’t replace, triple store • Partitioned statements a statement is stored in TS or a prop table, never both • Partitioned properties all values for a given property are in TS or a prop table

Triple Store plus Property Tables Triple Store Only Person Property Table Triple Store

Property Table • Advantages • efficiencies in storage and access • transparent to application • enables access to legacy relational tables (which can be modeled as property tables) • bridges the RDF-relational divide • Disadvantages • exhaustive search if property unknown (“Tony Blair”, -, -) • queries don’t compile to single SQL statement • loss of flexibility: fixed schema, typed property values

Creating Property Tables • Types of property tables • Column encoding • Table specification

subject obj1 obj2 objn subject object type subject obj1 obj2 objn Types of Property Tables Single-valued property table–stores several single-valued properties for a subject Multi-valued property table–stores one multi-valued property for a subject Property-class table– stores class membershiprdf:type – only property allowed in multiple tables

Property Table Column Encoding Issue: how to encode values in columns? • Option1: Jena encoding or symbol id’s “::en:http://www.hp.com/ex#foo” or 1234 • Option2: native db encoding “foo” Choice: support both Option2 needed to access legacy database tables

Property Table Creation Property tables are • user-defined • sharable across graphs • created when graph is created • specified in a meta-graph (RDF stmts) table name, type, column descriptors, etc.

Multi-Dimensional Clustering • 常规表的索引是基于记录来建立的。索引的任何群集都仅限于单个维。在版本 8 之前，DB2(R) 通用数据库（DB2 UDB）仅支持通过群集索引对数据进行单维群集。通过使用群集索引，在表中插入和更新记录时，DB2 UDB 会尝试按索引的键顺序来维护各页上数据的物理顺序。群集索引极大地提高了具有包含群集索引的一个或多个键的谓词的范围查询的性能。使用好的群集索引可以提高性能，这是因为只需要存取表的一部分，并且可以执行更有效地预取。

Clustering Index

Block Index

Comparison 7

Block Identifier

Advantages • 由于与基于记录的索引相比，由于块索引的大小是很小的，所以探测和扫描块索引要快得多 • 块索引和相应的数据组织允许更精细的“分区忽略”或者有选择性地进行表存取 • 利用块索引的查询因为减小了索引大小、优化了对块的预取并且可以保证相应数据的群集而受益 • 对于某些查询，可以减少锁定和谓词求值 • 块索引用于日志记录和维护方面的开销很少，因为仅当将第一条记录添加至块或从块中除去最后一条记录时才需要更新它们 • 转入的数据可以重用先前转出的数据留下的连续空间。

Storage Experiments • Experiments on the Avnet data set • T Box: 117 classes, 232 properties: 179 datatype/53 object properties • A Box: • First experiments performed on 1 millions products (25 millions triples) • Improve vertical storage • Leverage MDC (Multidimensional Clustering) of DB2 to reduce I/O cost and improve performances • Choose best indexes • Use of hash codes to improve string search. • Add binarytables for searchable attributes. • desktop computer, 2.66G Hz, 2G memory • New version of SOR tested with 4.2 millions products, representing by 120 millions of triples • IBM Blade Server, Intel Xeon 2.8 G Hz, 4G memory

Storage Experiments: Queries • Queries on properties • Q1: Retrieve one item according to its ID • SELECT * WHERE {?x test:PartKey '34389'} • Q2: Retrieve all items made by a given supplier • SELECT ?x WHERE {?x test:hasSupplier test:MUR} • Queries involving relationships • Q3: Retrieve all items whose parent item is made by a given supplier • SELECT * WHERE {?x test:hasParent ?y . ?y test:hasSupplier test:MUR} • Q4: Q3 and SOSCode !=‘S’ • SELECT * WHERE {?x test:hasParent ?y . ?y test:hasSupplier test:MUR . ?x test:SOSCode ?b FILTER (?b != 'S')} • Queries where predicate (property) is not defined • Q5: Retrieve all information about a given item • SELECT * WHERE {?x?y?z . ?x test:PartKey '34389'} • Q6: Retrieve all items and relationships to a given company • SELECT * WHERE {?x?y test:MUR} • Queries on the class structure • Q7: Retrieve all items belonging at the same time to category C1 and C2 • SELECT ?x {?x a test:C1 . ?x a test:C2} • Q8: Q7 and manufacturer != supplier • SELECT ?x {?x a test:C1 . ?x a test:C2 . ?x test:hasSupplier ?y . ?x test:hasManufacturer ?z FILTER (?y != ?z)}

Normal tables • Dimension tables Storage experiments: Value of dimensions • Few changes on simple query (Q1) which stays around 1 second Time (ms) No. of Individuals • Similar results for simple queries (e.g. one triple)

Normal tables • Dimension tables Storage experiments: Value of dimensions (2) • Large improvement on complex query with several joins (here example of Q4) Time (ms) No. of individuals

Impacts of binary tables • Define one table for every searchable attribute • (SubjectID, ObjectID, ObjectValue) • Binary tables very efficient when query involves only one property. • Binary tables also efficient when selectivity is high. • Dimensioned vertical table scales when the number of join increases.

Storage experiments: significant improvement

Summary and Questions • Optimization of Triple Stores: • Jena’s property table • Multi-Dimensional clustering • Binary store vs Generic Store

Outline • Ontology Storage • Dimension table • Jena’s property table • Binary store vs Improved Generic Store • Ontology Reasoning • Rule Inference • Forward chaining • Backward chaining • An Efficient Scheme to scale up classic DL reasoner • A method to extend DLs with constraint • Ontology Query • Semantics and Complexity of SPARQL • Relational algebra for SPARQL • A Complete System: Oracle RDF Store

Rules and Inference • Rules are really statement in logic • About what we believe to be true • About what should occur • Form • IF antecedent THEN conclusion • IF condition THEN action • IF antecedent THEN goal • Interpreters • Backward chaining • Trigger on conclusion/goal • Forward chaining • Trigger on antecedent/condition

Forward Chaining • Rules • R1: IF may_rain THEN should_take_umbrella • R2: IF cloudy THEN may_rain • “What should I do if it is cloudy?” • “What do the rules indicate I should do if it is cloudy?” • Is there a rule that applies when it is cloudy? • R2: antecedent: cloudy • What do I conclude from that antecedent, ‘cloudy’ • R2: conclusion: may_rain • Is there a rule that applies when it may_rain? • R1: antecedent: may_rain • What do I conclude from that antecedent: ‘may_rain’ • R1: conclusion: should_take_umbrella

Forward Chaining Fact_1 Fact_1 Fact_2 Fact_2 Fact_3 rules Direction of reasoning Fact_3 Fact_4 Fact_4 Fact_5 Action=Fact_5

Forward Chaining • Fire any rule whose premises are satisfied in the KB. • Add its conclusion to the KB until query is found.

Forward Chaining Example P ) Q L  M ) P B  L ) M A  P ) L A  B ) L A B

Backward Chaining • Motivation: Need goal-directed reasoning in order to keep from getting overwhelmed with irrelevant consequences • Main idea: • Work backwards from query q • To prove q: • Check if q is known already • Prove by backward chaining all premises of some rule concluding q

Backward Chaining • Rules • R1: IF may_rain THEN should_take_umbrella • R2: IF cloudy THEN may_rain • “Should I take an umbrella?” • “Do the rules indicate I should take an umbrella? • Is there a rule about “taking umbrellas”? • R1: goal: should_take_umbrella • How can I prove that goal? • What has to be true for r1 to hold? • may_rain is the antecedent of r1 • Can I prove that it may_rain? • R2: goal: may_rain • How can I prove that goal2 • What has to be true for r2 to hold • cloudy is the antecedent of r2 • How can I prove ‘cloudy’?

Backward Chaining Goal_1  Goal_2 Goal_2  Goal_3 rules Direction of reasoning Goal_3  Goal_4 Goal_4  Goal_5 Question

Backward Chaining • Rules • R1: IF may_rain THEN should_take_umbrella • R2: IF cloudy THEN may_rain • R3: IF may_be_intense_sun THEN should_take_umbrella • R4: IF summer AND in_tropics THEN may_be_intense_sun • “Should I take an umbrella?” • “Do the rules indicate I should take an umbrella? • Is there are rule about “taking umbrellas”? • R1: goal: should_take_umbrella • What is antecedent for r1? • R1:antecedent may_rain • Can I prove that it may_rain? • R2: goal: may_rain • How can I prove may_rain • R2:antecedent: cloudy • BUT NOT CLOUDY!

Backward Chaining: Backtracking • Rules • R1: IF may_rain THEN should_take_umbrella • R2: IF cloudy THEN may_rain • R3: IF may_be_intense_sun THEN should_take_umbrella • R4: IF summer AND in_tropics THEN may_be_intense_sun • “Should I take an umbrella?” • Are there any other rules about umbrellas? • R3: goal: should_take_umbrella • What is antecedent of R3? • R3:antecedent: summer AND in tropics

Backwards Chaining with Backtracking Goal_1  Goal_2 Goal_8  Goal 7 Goal_2  Goal_3 Goal_7  Goal_6 rules fail Direction of reasoning Goal_3  Goal_4 Goal_6  Goal_4 Goal_4  Goal_5 Question

Backward Chaining Example P ) Q L  M ) P B  L ) M A  P ) L A  B ) L A B

Principles of Ontology Storage, Reasoning and Query