WP3: Data Provenance and Access Control

WP3: Data Provenance and Access Control Irini Fundulaki, FORTH December 11-12, 2012, Luxembourg

Extended example scenario GADM-RDF, Geospecies Provenanceandaccesscontrol SSN ontologies registry C-SPARQL/SPARQL-STR/HTTP Quality control relational DBMS stream DB stream DB RDF (stream) DB CSV twitter

First Year Review Comments • Collaboration • Provenance and Data Quality (with UPM, FUB for WP2) • Collaboration with KIT for a privacy awareaccess control framework • Real versus syntheric data • PlanetData datasets (GADM-RDF1, GeoSpecies2 enhanced with links from DBPedia using LDSpider), GO3 and CIDOC4 ontologies • Scalability • Real datasets with up to 11M triples 1 GADM-RDF http://gadm.geovocab.org/ 2 GeoSpecies: http://lod.geospecies.org/ 3 Gene Ontology: http://www.geneontology.org/ 4 CIDOC-CRM: http://www.cidoc-crm.org/

Access Control • Access Control: • Refers to the ability to allow or deny the use of a particular resource by a particular entity • Crucial for sensitive content since it ensures the selective exposure of information to different classes of users • Access Control Enforcement techniques for RDF data in the presence of RDFS inference: • Materialization (forward chaining) • Query Rewriting (backward chaining) • Static Analysis

p s o label &a type Student yes sc Student Person yes s p o label &a type Person yes Access Control Models • Standard access control models associate a concrete label to a triple to indicate whether the triple can be accessed or not • Built in rule: An implied RDF triple can be accessed if and only ifall its implying triples can be accessed

Access Control Models • In the case of any kind of update, the implied triples & their labels must be re-computed • An implied RDF triple can be accessed if and only if all its implying triples can be accessed p s o label &a type Student yes sc Student Person yes no s p o label &a type Person yes no the overhead can be substantial in large datasets when updates occur frequently

Our Approach • Fine-grained Abstract Access Control Model comprised of abstract tokens and operators • focus at the RDF Triple level • focus on permissions for read-only queries • labeled triples are encoded as quadruples • supports RDFS inference and propagation of labels across the RDFS hierarchies • encodes how the access label of an implied quadruple is computed • concrete policies are used to assign specific values to the abstract tokens and operators • Implementation of a fine-grained, repository independent, portable across platformsaccess control framework on top of the MonetDB column store

p s o label l1 • l1 , l2:abstract tokens &a type Student q1: l2 q2: Student sc Person Person type class l3 q3: s p o label ⊙ : entailment operator records the triples used in the implication &a type Person l1⊙ l2 q4: q5: Student type class l2⊙ l3 s p o label  : propagation operator  l3 type Person &a q6: p o label s  : default access token &a fname Alice  q7: Annotating Triples with Access Tokens

Annotating Triples with Access Tokens • Authorization Rules s p o l A1: (construct {?x lname ?y} where {?x type Student},l1) l2 q1: Student sc Person l2 q2: Person sc Agent A2:(construct {?x sc ?y},l2) l3 q3: &a type Student A3:(construct {?x type Student},l3)  q4: &a fname Alice A4:(construct {?x type class},l4) l4 q5: Agent type class l5 q6: Student sc Person A5:(construct {Student sc ?y},l5) Authorizations (Query, abstract token) RDF quadruples

(A1, sc, A2, l1) (A2, sc, A3, l2) (A1, sc, A3, l1 ⊙l2) (&r1, type, A2, l1 ⊙l2) (&r1, type, A1, l1) (A1, sc, A2, l2) Computing the labels of implied triples • RDFS Inference: generate new knowledge - subClassOf - typeOf s p o l s p o l l2⊙ l2 l2 q8: Student sc Agent q1: Student sc Person l5⊙ l2 l2 q9: Student sc Agent q2: Person sc Agent q3: &a type Student l3⊙ l2 l3 q10: &a type Person l4 (l3⊙ l2) ⊙ l2 class q5: Agent type q11: &a type Agent (l5⊙ l2) ⊙ l2 q12: &a type Agent Person q6: Student sc l5 RDF quadruples

(A1, sc, A2, l1) (A2, sc, A3, l2) (A1, sc, A3, l1 ⊙l2) (&r1, type, A2, l1 ⊙l2) (&r1, type, A1, l1) (A1, sc, A2, l2) s p o l l2 q1: Student sc Person l2⊙ l2 q8: Student sc Agent l2 q2: Person sc Agent l5⊙ l2 q9: Student sc Agent q3: &a type Student l3 l3⊙ l2 q10: &a type Person l4 class q5: Agent type (l3⊙ l2) ⊙ l2 q11: &a type Agent Person q6: Student sc l5 (l5⊙ l2) ⊙ l2 q12: &a type Agent RDF quadruples Computing the labels of implied triples • RDFS Inference: generate new knowledge - subClassOf - typeOf s p o l

s p o l o l s p l2⊙ l2 q8: Student sc Agent l2 q1: Student sc Person l5⊙ l2 q9: Student sc Agent l2 q2: Person sc Agent l3⊙ l2 q3: &a type Student q10: &a type Person l3 (l3⊙ l2) ⊙ l2 l4 &a type Agent q11: class q5: Agent type (l5⊙ l2) ⊙ l2 q12: &a type Agent Person q6: Student sc l5  l4 q13: &a type Agent RDF quadruples Computing the propagated labels • Propagating labels along the RDFS hierarchies: no new knowledge created (A1, type, class, l1) (A2, type, A1, l2) (A2, type, A1,  (l1 ))

From Abstract Models to Concrete Policies • Concrete Policies implement Abstract Model: • Set of Concrete Tokens • Mapping from abstract to concrete tokens • Set of concrete operators that implement the abstract ones • Conflict resolution operator to resolve cases where the same triple is assigned different labels • Access Function to decide when a triple is accessible

l1l2ifl1andl2are different from  l1 ifl2is  l1⊙l2 =  ifl1 , l2are equal to Concrete Policy C1 • Concrete Tokens: LP = { true, false} • Concrete Operators: • Entailment Operator ⊙ • Propagation Operator  : Identity • Conflict Resolution Operator  • Access function: triples with label true are accessible, otherwise, inaccessible false if value false is inS true if false does not belong inS  S =  otherwise

true true true true Concrete Policy C1 l s p o l o s p l2 q1: Student sc Person Student sc Person q1: l2 q2: Person sc Agent Person sc Agent q2: l3 q3: q3: &a type Student &a type Student l4 q5: Agent type class Agent type class q5: l5 false q6: q6: Student sc Person Student sc Person l2⊙ l2 true q8: Student sc Agent Student sc Agent q8: l5⊙ l2 q9: Student sc Agent false Student sc Agent q9: l4 true q13: &a type Agent &a type Agent q13: Mapping: l2  true l4  true l5  false l3  true

Evaluation • Implementation: • Relational Schema: • Authorization(auth_id, query, access_token) • ExplicitTriples(qid, s, p, o, auth_id): stores explicit triples • InferopTriples(qid, s, p, o): stores implicit quadruples • PropOp(qid, s, p, o, l): stores the propagated quadruples • LabelStore(qid, qid_uses): stores for each implied and propagated quadruple, the explicit quadruples that are used in its implication; • Query Engine: CWI’s MonetDB Open Source Column Store • Stored Procedures are used for the computation of the access labels

Benchmark • Datasets: • Synthetic Datasets: produced by PowerGen [1] • Real Datasets: CIDOC, GO, GeoSpecies and GADM Datasets • Authorizations: • 15 authorizations • Concrete Policy: • Access Tokens: {low, medium, high} • low < medium < high • Mapping: • same triple is assigned different labels and the same label is assigned to different triples • 5 authorizations - assign to 35% of triples low label • 5 authorizations - assign to 55% of triples medium label • 5 authorizations - assign to 45% of triples high label • Operators: • Entailment, Conflict Resolution: min(), Propagation: identity

Experiments • Experiment 1:Annotation Time • the time required to compute the implied quadruples and the propagated labels • Experiment 2:Evaluation Time • the time needed to compute for a concrete policy, the concrete access label of a percentage of the RDF triples

Real Datasets • Characteristics • Experiment 1: Annotation Time 35451 35451 35451

Real Datasets: Experiment 2

Synthetic Datasets: Experiment 1 Annotation timedoes not depend on the number of explicit (i.e, initial triples in the dataset) but on the number of produced implied quadruples

Synthetic Datasets: Experiment 2 • Characteristics • Observations • Dataset D2 has a larger complexity than D1 (due to the larger depth) leading to more complex abstract expressions

Pros & Cons of Abstract Access Control Models • Pros: • Same application can experiment with different concrete policies over the same dataset • e.g., liberal vs conservative policies for different classes of users • Different applications can experiment with different concrete policies for the same data • In the case of updates there is no need re-compute the access labels of the inferred triples • Cons: • overhead in the required storage space • expressions that describe how a label is computed can become quite complex depending on the structure of the dataset

WP3: Work Plan View 24 12 0 6 18 30 36 42 D 3.2 Provenance management and propagation through SPARQL query and update languages Task 3.1 Provenance Management FORTH D 3.1 Access control specification language, reasoning and enforcement mechanisms D 3.3 Access control system and privacy-aware language Task 3.2 Privacy, DRM and Access Control FORTH D 3.4 Trust management and inference system Task 3.3 Trust management EPFL

BackUp Slides

WP3: Data Provenance and Access Control