SPARQL 201: Construct queries and data maintenance

SPARQL 201: Construct queries and data maintenance Nicholas Rejack – nrejack@ufl.edu VIVO Implementation Fest – Boulder, CO Wednesday, May 16, 2012 – 3:30 – 4:15 PM

Queries for exploring unfamiliar data • When encountering unfamiliar endpoints, how do you explore them? • First: check what ontologies exist (Ex. (1)) • Second: find out about the ontologies (2) • Third: Look at all the classes (3), object properties (4), datatype properties (5) • Examine them further, and so on… • How do we know which classes are populated, and how many?

CONSTRUCT: creating data for removal and transform • CONSTRUCT syntax: • CONSTRUCT graph pattern • WHERE • Matching graph pattern • Output to RDF/XML, etc. • -uses: get all the uses of 1 predicate out of VIVO • transform all data in a predictable way • Examples (6) (7)

Finding data that may not be there: OPTIONAL • Graph pattern matches require matches on all the terms • What if you are missing some terms? Use OPTIONAL clause • Warning: multiple OPTIONALs can cause performance decrease • Syntax: • SELECT * • WHERE • { • OPTIONAL { graph pattern } • } • (8)

Sorting: ORDER BY • Order results by using ORDER BY keyword • Can even use bound variable that does not appear in results (e.g. not in SELECT clause) • Syntax: • SELECT * • WHERE • { • Graph pattern • } • ORDER BY bound variable • (9), (10)

Negation • Two options: NOT EXISTS and !bound • Syntax: • SELECT * WHERE • { • NOT EXISTS { graph pattern } • } • (11) • SELECT * • WHERE • { • Graph pattern • OPTIONAL { graph pattern with ?bound variable} • FILTER (!bound(?bound variable))} • (12) (12b)

Data cleanup • SPARQL is one of the best tools for data cleanup. • Use cases: • Grab a batch of related statements. Delete them en masse. • Generate missing data, upload. • Find missing property statements, such as people with no link to a position. Use !BOUND. • Find data that is less than the required # of digits. • Find similar names. Match on last name and first initial, first 2 letters, etc. • Examine everything in a certain class that isn’t in another specified class.

Regex matching • A couple versions: • Matching string of particular length: • SELECT * WHERE • { • Graph pattern • FILTER regex(bound variable, “condition”)) • } • (14) • Match eight characters ending with 1: • Condition = “…….1” • Limit to a certain length: • Condition = “…….+” • Exact match: • Condition = “matching_string” (note: regex not required, can do ?org ufVivo:harvestedBy "DSR-Harvester" . • Use negation where applicable. • Match beginning of string: • Condition: “^abcdef"

GROUP BY and HAVING • When variables have multiple bindings, you get a returned row for each. • E.g.: “show me everyone’s label” returns a row for each URI assignment to each label- if a URI has 4 labels, you get 4 returns • Not useful for counting • Use GROUP BY to collapse on a particular bound variable • Use HAVING to filter on numeric expression • Example: find entities with > 1 label (15) • Side note: use >, < with numeric values to filter • E.g., “FILTER (?value < 10)”

The black art of query optimization • SPARQL can be very slow. • How to increase the speed of returns: • 1) Minimize the number of OPTIONALs you use • 2) “Pre-load” your queries by reducing your result set earlier: • Instead of: • ?x rdf:typefoaf:Person . • ?x ufVivo:ufid ?ufid . • Reduce the result set with: • ?x ufVivo:ufid ?ufid (we assume only people have UFIDs) • 3) Change scope by wrapping lines in { } . Experiment! (Thanks to Alex) • 4) Write your queries in an iterative fashion- comment out lines (#) and slowly increase the complexity

SPARQL 201: Construct queries and data maintenance