Towards a Top-K SPARQL Query Benchmark Generator

Towards a Top-K SPARQL Query Benchmark Generator Shima Zahmatkesh1, Emanuele Della Valle1, Daniele Dell’Aglio1, and Alessandro Bozzon2 1Politecnico di Milano 2TU Delft

Agenda • Rankings, Rankings everywhere • What are top-k SPARQL queries • Jim Gray's Benchmarking Principles • The problem • Some Definitions • Research Hypothesis • Background work: DBpedia SPARQL Benchmark • Our proposal: Top-k DBPSB • Preliminary Evaluation • Conclusions

Rankings, rankings everywhere

Why do we need to optimize them? A very intuitive and simplified example: • Top 3 largest countries (by both area and population)

The standard way: materialize-then-sort scheme Fetch 3 best results … Sort allthe 242 countries Compute the scoring function that accounts for area and population Countries 242 … …

Innovative optimization:Split-and-Interleave scheme Fetch 3 best results 3 Incrementally order partial results by area 9 Sorted access to countries ordered by population Countries 242

State-of-the art Database • method • Split the evaluation of the scoring function into single criteria • Interleave them with other operators • Use partial orders to construct incrementally the final order • Standard assumptions: • Monotone scoring function • Each criterion is evaluated as a [0,1] number (normalization) • Optimized for the case of fast sorted access for each criterion

Top-k SPARQL queries E.g., the 10 most recent books written by the youngest authors SELECT ?book ?author (0.5*norm(?releaseDate) + 0.5*norm(?dateOfBirth) AS ?s ) WHERE { ?book dbp:isbn ?v . ?book dbp:author ?author . ?book dbp:releaseDate ?releaseDate . ?v3 dbp:dateOfBirth ?dateOfBirth . } ORDER BY DESC(?s) LIMIT 10 Scoring Functionas a SELECT expression Normalization cast the value in [0..1] x - minx norm(x) = maxx - minx Order and slice

The Problem Set up a benchmark for top-k SPARQL Queries that • Resembles reality • Stresses the features of top-k queries • Syntax: SELECT expression + ORDER BY + LIMIT • Performance: hit SPARQL engine where it hurts

Jim Grayon Benchmarking Results Principles • Relevant: Measures performance and price/performance of systems when performing typical operations within the problem domain • Portable: Easy to implement on many different systems • Scalable: Applies to small and large computer systems • Simple: understandable

Definitions E.g., the 10 most recent books written by the youngest authors Scoring Function 0.5* norm(?releaseDate) + 0.5*norm(?birthDate) Rankable Triple Patterns releaseDate ?book Triple Patterns author ?releaseDate ?birthDate dateOfBirth ?author Rankable Variables Rankable Data Properties Scoring Variables

Research Hypothesis • H.0 top-k SPARQL queries that resemble reality can be obtained extending DBpedia SPARQL Benchmark • H.1 ++ Rankable variable  ++ execution time • H.2 ++ Scoring variable  ++ execution time • H.3 +/- LIMIT  = execution time

DBpedia SPARQL Benchmark • A method to generate a SPARQL benchmark from DBpedia an its query longs • It can be applied to other datasets and other query logs • Characteristics • Resemble reality • Stress SPARQL features Query Logs Dataset generation Query Analysis and Clustering Auxiliary Queries Queries Templates Query Instances

Proposed Solution Auxiliary Queries Top-k DBPSB • An extension of DBPSB Auxiliary query with top-k clauses using the DBPSB datasets as source of meaningful rankable variables • It is also a method • Can be applied to other benchmark obtained using DBSBM method Find Rankable Variables Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries

A DBPSB Auxiliary Query Auxiliary Queries SELECT DISTINCT ?v WHERE { ?v6 rdf:type ?v . ?v6 dbp:name ?v0 . ?v6 dbp:pages ?v1 . ?v6 dbp:isbn ?v2 . ?v6 dbp:author ?v3 . } Find Rankable Variables Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries

Top-k DBPSB step 1a Auxiliary Queries To generate queries with 1 rankable variable SELECT ?p (COUNT(?p) AS ?n) WHERE { ?v6 rdf:type ?v . ?v6 dbp:name ?v0 . ?v6 dbp:pages ?v1 . ?v6 dbp:isbn ?v2 . ?v6 dbp:author ?v3 . ?v6 ?p ?o . FILTER(isNumeric(?o) || datatype(?o)=xsd:dateTime) } ORDER BY ORDER BY DESC(?n) Find Rankable Variables Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries

Top-k DBPSB step 1b Auxiliary Queries Results – not all sortable properties resemble reality • Pages • ISBN • NumberOfPages • Year • Volume • wikiPageID • releaseDate • … NOTE: it requires manual selection Find Rankable Variables Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries

Top-k DBPSB step 1c Auxiliary Queries To generate queries with 2 rankable variables SELECT ?p ?p1 (COUNT(?p1) AS ?n) WHERE { ?v6 rdf:type ?v . ?v6 dbp:name ?v0 . ?v6 dbp:pages ?v1 . ?v6 dbp:isbn ?v2 . ?v6 dbp:author ?v3 . ?v6 ?p ?o . ?o ?p1 ?o1 . FILTER(isNumeric(?o1) || datatype(?o1)=xsd:dateTime) } GROUP BY ?p ?p1 ORDER BY DESC(?n) NOTE: in practice we loop through all properties of ?v6 whose object is an IRI in decreasing frequency Find Rankable Variables Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries

Top-k DBPSB step 1d Auxiliary Queries Results • author, wikiPageID • author, wikiPageRevisionID • … • author, dateOfBirth • … • publisher, wikiPageID • publisher, wikiPageRevisionID • … • publisher, founded • … NOTE: it requires manual selection Find Rankable Variables Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries

Top-k DBPSB step 2 Auxiliary Queries SELECT (max(?o) as ?max) (min(?o) as ?min) WHERE { ?v6 rdf:type ?v . ?v6 dbp:name ?v0 . ?v6 dbp:pages ?v1 . ?v6 dbp:isbn ?v2 . ?v6 dbp:author ?v3 . ?v6 dbp:pages?o . FILTER(isNumeric(?o) || datatype(?o)=xsd:dateTime) } NOTE: the filter clause should not be necessary, but DBpedia is very dirty … Find Rankable Variables Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries

Top-k DBPSB step 3 Auxiliary Queries • Choose the number of ranking variables • Max three • E.g., books and authors • Choose the number of scoring variables per ranking variables • Max three • E.g., releaseDate for books and dateOfBirth for authors • Look up the min and the max of each ranking variable to normalise it • Choose the weights • The sum of the weight should be 1 • Assemble the scoring function • E.g., 0.5*norm(?releaseDate ) + 0.5*norm(?dateOfBirth) Find Rankable Variables Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries

Top-k DBPSB step 4 Auxiliary Queries SELECT ?v6 ?v3 (0.5*norm(?o1) + 0.5*norm(?o2) AS ?s ) WHERE { ?v6 rdf:type ?v . ?v6 dbp:name ?v0 . ?v6 dbp:pages ?v1 . ?v6 dbp:isbn ?v2 . ?v6 dbp:author ?v3 . ?v6 dbp:releaseDate ?o1 . ?v3 dbp:dateOfBirth ?o2 . FILTER(isNumeric(?o1) || datatype(?o1)=xsd:dateTime) FILTER(isNumeric(?o2) || datatype(?o2)=xsd:dateTime) } ORDER BY ?s LIMIT 10 Find Rankable Variables Compute Max and Min value Generate Scoring Function Generate Top-k queries Top-k Queries

Preliminary Results 1/2 • We tested our hypothesis using • Virtuoso Open-Source Edition version 6.1.6 • Jena-TDB Version 2.10.1 • DBpedia 10% • In this setting, Top-k DBPSB generates queries • adequate to test • H.2 ++ Scoring variable  ++ execution time • H.3 +/- LIMIT  = execution time • only partially adequate to test • H.1 ++ Rankable variable  ++ execution time

Preliminary Results 2/2 • H.1 ++ Rankable variable  ++ execution time • confirmed in some cases • not confirmed aggregating by query across engine • confirmed aggregating by engine across queries • H.2 ++ Scoring variable  ++ execution time • confirmed for Jena TDB • confirmed in most of the cases for Virtuoso • H.3 +/- LIMIT  = execution time • confirmed for Jena TDB • confirmed in most of the cases for Virtuoso

Conclusions • Top-k DBPSB is a successful first attempt to automatically generate Top-k SPARQL queries that • Resemble reality • Hit SPARQL engines where it hurts • More investigation is required • Better understand the relationships between the number of rankable variable and the execution time • E.g., cardinalities, selectivity and jooins • Include over known features of top-k query that impact execution time • E.g., correlation of order induced on the result set by the different scoring variable in the scoring function • E.g., Distribution of values matched by the scoring variables

Thank you! Any Question? Shima Zahmatkesh1, Emanuele Della Valle1, Daniele Dell’Aglio1, and Alessandro Bozzon2 1Politecnico di Milano 2TU Delft

Preliminary Results - details

Towards a Top-K SPARQL Query Benchmark Generator

Towards a Top-K SPARQL Query Benchmark Generator

Presentation Transcript

A Research SPARQL Query Compiler

Top-k Query Processing

SPARQL Intro: A query language for RDF

Berlin SPARQL Benchmark (BSBM)

SPARQL Intro: A query language for RDF

KLEE : A Framework for Distributed Top-k Query Algorithms

Top-k Query Processing and Optimization

Towards a Packet Classification Benchmark

SPARQL Query Language for RDF

KLEE : A Framework for Distributed Top-k Query Algorithms

SPARQL SPARQL Protocol and RDF Query Language

SPARQL AN RDF Query Language

SPARQL Query Language for RDF

SPARQL Query Language for RDF

SPARQL Query Language for RDF

SPARQL - A query language for RDF(s)

IO-Top-k: Index-access Optimized Top-k Query Processing

SPARQL Query Graph Model (How to improve query evaluation?)

SPARQL Query Optimization

SPARQL Query Optimization

The Random Query Generator

IO-Top-k: Index-access Optimized Top-k Query Processing