550 likes | 708 Views
Linked Data Best Practices (and Abuses ) Lessons Learned in IBM Rational. Arthur Ryman 2014-04-15. Best Practices. Publishing vocabularies Data model customization Real-world things JSON and RDF Multi-valued and optional properties Provenance and inverse properties
E N D
Linked Data Best Practices(and Abuses)Lessons Learned in IBM Rational Arthur Ryman2014-04-15
Best Practices • Publishing vocabularies • Data model customization • Real-world things • JSON and RDF • Multi-valued and optional properties • Provenance and inverse properties • Ontologies and constraints
Publishing vocabularies • We should use established vocabularies if they exist • W3C, Dublin Core, OSLC, … • Any new terms we define should be described in vocabulary documents rooted at http://jazz.net/ns • propose generally useful terms to OSLC • When you look up an RDF term, you should get its vocabulary document • HTML for web browsers • RDF for programs, e.g. query builders • e.g. http://jazz.net/ns/qm/rqm#Category
How to publish a vocabulary • We have a new public wiki! • https://jazz.net/wiki/bin/view/LinkedData • Read the guidelines • Create a wiki page and attach the HTML, Turtle, and RDF/XML files • Request a review from Nelson • Allow dev time to address issues • Arthur will redirect jazz.net/ns to the wiki
Abuses • You published your vocabulary but skimped on the content • e.g. minimal or cryptic comments • You published your vocabulary, but didn’t keep it up-to-date • e.g. Focal Point 227292 • You created some new terms but didn’t publish your vocabulary • e.g. JLIP Tracked Resource Set 306919
Data model customization • Many of our tools allow customization • e.g. RTC work items • We need to expose the custom data elements as RDF • Tools should allow users to map custom data elements to externally defined RDF terms • industry standards • corporate standards • When no mapping is specified, tools should generate local RDF terms and vocabularies • vocabularies are needed by query authors • tools must host the vocabularies they generate
Abuses • Your tool generates a cryptic URI for local RDF terms • Obfuscates meaning • Forces humans to access vocabulary document • Your tool does not generate a vocabulary document for local RDF terms • e.g. RTC 304143 • see following case study • When the mapping to RDF is changed, your tool does not create TRS change events for just the affected resources
Case study: RTC Work Items • Some attributes are built-in • Some are defined by OSLC CM 2.0 • Some are user defined • Consider Priority
RDF triple for priority • Subject (good) <https://jazzop05.rtp.raleigh.ibm.com:9943/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/224727> • Predicate (bad) <http://open-services.net/ns/cm-x#priority> • Object (ugly) <https://jazzop05.rtp.raleigh.ibm.com:9943/jazz/oslc/enumerations/_QYx2UBIzEd6bpunPP4ZLOA/priority/priority.literal.l3>
Problems • The priority predicate comes from a non-existent vocabulary (bad) • http://open-services.net/ns/cm-x# • RDF vocabularies should be dereferenceable • OSLC should publish it, tagged as archaic • The object is a dereferenceable URI (good), but not a vocabulary term (ugly) • Need rdfs:label, rdfs:comment for query authors • Result: no easy way to write queries based on priority
Best Practice for external vocabularies • RTC project template should refer to external vocabularies for standard terms • OSLC CM V3 defines priority and 4 values • Teach and enable clients to create corporate standard vocabularies for reuse of common terms (UA) • Needed for cross-project queries • Provide export/import UI to manage vocabularies • E.g. Focal Point uses simple spreadsheet format
Best Practice for local vocabularies • RTC (and all other tools) should generate a local RDF vocabulary for all user-defined terms • Include rdfs:label, rdfs:comment for query authors (and other consumers) • LQE admin should load user-defined vocabularies into LQE to make them available to queries • provide programmatic integration, e.g. a special purpose vocabulary TRS
Best Practice for all vocabularies • When an administrator changes the RDF representation of a set of resources, corresponding change events MUST be added to the TRS change log • Add/remove custom attributes and values • Modify mapping to RDF URIs • Allow the administrator to make multiple representation changes and then manually trigger the generation of change events • Batch multiple representation changes together to minimize re-indexing time and server load
La Trahison des Images "The famous pipe. How people reproached me for it! And yet, could you stuff my pipe? No, it's just a representation, is it not? So if I had written on my picture "This is a pipe", I'd have been lying!“- René Magritte
Real-world things • Linked Data differentiates between two kinds of thing • Information, e.g. a document on the web • Real-world, e.g. a person • Both kinds should be identified with HTTP URIs • Looking up a real-world URI should result in an information resource that contains information about the real-world thing • URI-references (hash URIs) • HTTP redirect: 303 See Other (303 URIs) • Refer to Cool URIs for the Semantic Web
Example foaf:Person • Suppose you create a document, http://people.org/johnsmith, about John Smith on 2013-09-17 • The following is nonsense because John Smith was not created on 2013-09-17:<http://people.org/johnsmith> a foaf:Person .<http://people.org/johnsmith> dcterms:created “2013-09-17”^^xsd:date . • The following makes sense:<http://people.org/johnsmith#me> a foaf:Person .<http://people.org/johnsmith> dcterms:created “2013-09-17”^^xsd:date .
Abuses • Failure to differentiate between a person and an account owned by a person • Leads to nonsense triples • Focal Point Defect 234212 • JTS Defect 307861 • See following JTS users case study • NOTE: email address is the preferred way to identify people across tools
JTS Users • OSLC Core specifies that the object of dcterms:creator, dcterms:contributor, oslc:modifiedBy should be a resource of class foaf:Agent or foaf:Person (real-world) • RTC implements OSLC CM and has triples like:<https://jazz.net/jazz02/resource/...WorkItem/72226> dcterms: creator <https://jazz.net/jts04/users/ryman> ,dcterms:contributor <https://jazz.net/jts04/users/retchles> .
Best Practice • The property j.1:archived applies to the user account (information resource), not the person (real-world) • Solution 1: use hash URIs for people:<https://jazz.net/jts04/users/ryman#me> • Solution 2: use 303 URIs for accounts (preferred by Philippe):<https://jazz.net/jts04/accounts/ryman>
303 URI Solution @prefix foaf: <http://xmlns.com/foaf/0.1/>. @prefix jfs: <http://jazz.net/xmlns/prod/jazz/jfs/1.0/>. <https://jazz.net/jts04/accounts/ryman> a foaf:OnlineAccount , jfs:archived false. <https://jazz.net/jts04/users/ryman> a foaf:Person; foaf:account < https://jazz.net/jts04/accounts/ryman> , foaf:img <https://jazz.net/jts04/users/photo/ryman>; foaf:mbox <mailto:ryman@ca.ibm.com>; foaf:name "Arthur Ryman"; foaf:nick "ryman".
JSON • Familiar to OO and Web developers • Popularity fueled by Cloud • e.g. Amazon uses JSON as the payload in AWS REST APIs as an alternative to SOAP and XML • Simpler/faster to handle by web clients • Use is spreading across the stack • MongoDB, CouchDB/Cloudant • node.js
JSON and RDF • Some developers are saying: “JSON is simpler and more popular than RDF. Let’s use JSON instead of RDF.” • This is a false dichotomy • JSON is just as problematic as XML for data integration • JSON and XML are message formats • Linked Data is our integration strategy • RDF expresses semantics • Use JSON-LD, now a W3C standard • OSLC and Rational should publish standard contexts • See following LQE Security Context case study
Initial JSON design • Simple, but no explicit semantics • Use of UUIDs instead of HTTP URIs [ { "security_context_id" : "urn:uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6", "name" : "Resources for Alpha project" }, { "security_context_id" : "urn:uuid:g92e5gbf-8efd-22e1-b876-11b1d02f7cg7", "name" : "Resources for Beta project" } ]
Equivalent JSON-LD design { "@context": { "@base": "https://example.com/sc", "dcterms": "http://purl.org/dc/terms/" }, "@graph": [ { "@id": "#1", "dcterms:title": "Resources for Alpha project" }, { "@id": "#2", "dcterms:title": "Resources for Beta project" } ] }
Final JSON-LD design with type info { "@graph": [ { "@id": "https://example.com/sc", "@type": "http://open-services.net/ns/core/sc#SecurityContextList" }, { "@id": "https://example.com/sc#1", "@type": "http://open-services.net/ns/core/sc#SecurityContext", "http://purl.org/dc/terms/title": "Resources for Alpha project" }, { "@id": "https://example.com/sc#2", "@type": "http://open-services.net/ns/core/sc#SecurityContext", "http://purl.org/dc/terms/title": "Resources for Beta project" } ] }
Multi-valued and optional properties • RDF documentations contain sets of triples • Model multi-valued properties by a set of triples that share a common subject and object • Model the absence of an optional property by an empty set of triples
Abuses • Model multiple values of a property by concatenating the values into a single object • Defeats database indexing • Slows queries since substring matching must be used • Model the absence of an optional value using the presence of an empty string • Adds many unnecessary triples • Slows queries (longer scans) • Sometimes an empty string is a meaning value • Sometimes an empty string is lexically invalid • See following RTC tag case study Defect 271867
RDF representation @prefix dcterms: <http://purl.org/dc/terms/> .@prefix rtc_cm: <http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . @base <https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/> . <271867>dcterms:subject "datagap, oslc, next_release_candidate, data_gap, reporting-gap"^^xsd:string;…rtc_cm:estimate ""^^xsd:long . Syntax validated OK. There were warnings: Typed literal has an invalid lexical value: Input string was not in the correct format: s.Length==0.: ""^^<http://www.w3.org/2001/XMLSchema#long>.
Provenance: Where did the triple come from? • A statement is represented by a triple • Triples from multiple documents may be merged and queried • Default graph is a triple store • When storing RDF documents, the document URL is often used as the name of a graph (e.g. in LQE) • triple + graph name = quad • triple stores are really quad stores • Provenance of triples is important in several use cases • Updating a document • Access control • VVC (which version)
Provenance and authority • The authority (trust) of a triple depends on the author of the document that contains the triple • Triples should be placed in the document that the author is authorized to modify • When creating a link from A to B, put the link in the document that the author is editing, not necessarily A or B or both • Document C may contain links from A to B
Inverse properties • Directed relations between resources (links) may be stated in two equivalent ways, e.g. • Testcase1 validates Requirement2 . • Requirement2 isValidatedBy Testcase1 . • There is no benefit to having mutual inverse pairs of properties • The existence of mutual inverse pairs of properties makes query authoring more complex, and query execution more expensive • A triple should be put in the document that the author of the triple is editing (provenance) • There is no special significance attached to being the subject of a triple • See OSLC guidance on preferred direction of properties • Direction should be from downstream to upstream, • e.g. test case validates requirement
Abuses • OSLC domain specs define many pairs of mutual inverse predicates • Recommendation • Deprecate one member of each pair • Replace deprecated property in all RDF representations and queries