340 likes | 605 Views
SPARQLeR: Extended Sparql for Semantic Association Discovery. Krzysztof Kochut and Maciej Janik. ESWC 2007, Innsbruck, Austria June 4 , 2007. Work supported by the National Science Foundation Grant No. IIS-0325464, entitled “SemDIS: Discovering Complex Relationships in the Semantic Web”.
E N D
SPARQLeR: Extended Sparql for Semantic Association Discovery Krzysztof Kochut and Maciej Janik ESWC 2007, Innsbruck, Austria June 4, 2007 Work supported by the National Science Foundation Grant No. IIS-0325464, entitled “SemDIS: Discovering Complex Relationships in the Semantic Web”.
Paths in RDF child child older works_for child Directed path child child Undirected path, but with specific properties and directionality Undirected path
Why are paths interesting ? • A path describes how entities are related. • Relationships on the path define meaning of this connection. • Entities on the path specify the content. • Do you have migraine? Try taking magnesium! • Path discovered by Dr. D.R.Swanson from partial information available in PubMed publications • stress can lead to loss of magnesium in the human body • migraine patients seem to be experiencing stress … that’s why … • migraine could lead to a loss of magnesium, so … take magnesium to fight migraine! Swanson, R.D. Migraine and Magnesium: Eleven Neglected Connections. Perspectives in Biology and Medicine, 31 (4). 526-557.
Formally, what is a simple path ? • Simple directed path between resources r0 and rn in a description base R: • sequence r0 p1 r1 p2 r2 , … , pn-1 rn-1 pn rn (n>0) • r0 p1 r1, r1 p2 r2 , … , rn-2 pn-1 rn-1, rn-1 pn rn (n>0) are triples in R. • all of the resources ri (0 ≤i ≤ n) in the path are distinct • Simple undirected path between resources r0 and rn in R: • sequence r0 p1 r1 p2 r2 , … , pn-1 rn-1 pn rn (n>0) • for each ri-1 pi ri (0 < i ≤ n) in the path, either ri-1 pi ri or ri pi ri-1 is a triple in R • all of the resources ri (0 ≤i ≤ n) in the path are distinct
Paths and SPARQL • SPARQL query can express only static graph patterns. • Some flexibility is introduced by an OPTIONAL part, but it does not solve path problems. • No support for flexible length path expressions. • Glycan biosynthesis pathway in biology has a specific pattern (properties), but its length may be unknown. • Path discovery may be of unknown length and pattern, like in Dr. Swanson’s example.
What we need to discover paths? • Knowledge discovery needs more flexible patterns. • Patterns may be partially known or even unknown (unrestricted path). • Properties on the path, their order and directionality create a specific meaning. • Entities on the path provide content. • Relationships to entities outside of the path give an additional context.
Proposed extensions • A path may have a flexible length • For computational reasons, length is limited. • Constraints on properties • Specific properties must appear in the path. • Their order and directionality is meaningful. • They can form a repeating pattern. • Constraints on resources • Specific resources must be on the path. • They can be anywhere on the path or at specific positions.
SPARQLeR • Extension of SPARQL for semantic association discovery. • Seamlessly integrated into the SPARQL syntax. • Graph patterns incorporating simple paths with constraints. • Constraints are based on regular expressions over properties.
What is a path in SPARQLeR ? • Path is a meta-property that connects two resources. • Defined as a sequence of interleaving properties and resources. • Starts and ends with properties (endpoint resources are not included). • A path of length 1 is a sequence with just one property. <rdf:Class rdf:about="http://meta.org/rdf-meta-schema#Path"> <rdfs:isDefinedBy rdf:resource="http://meta.org/rdf-meta-schema#"/> <rdfs:subClassOf rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Property"/> <rdfs:subClassOf rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq"/> <rdfs:label>Path</rdfs:label> <rdfs:comment>The class of RDFMS paths.</rdfs:comment> </rdf:Class>
Path patterns in SPARQLeR • Meta-property – similar concept to a property • Resource –[property] Resource • Resource –[path] Resource • Path as a Sequence • Test if a resource is in the path: • rdfs:member • Test if a resource is at a specific position in the path: • rdf:_2, rdf:_4, ... • SPARQLeR-specific path properties • Test all resources or all properties in the path: • rdfms:entityResource and rdfms:propertyResource Example: all resources on a path must be of type foo:Person
Path patterns (match of path variable) p1 p2 p1 p2 rdfms:entityResource rdfs:member rdf:_6 length: 4 elements: 7 1 3 5 7 2 4 6 rdf:_3 p3 p2 rdfs:member p1 rdfms:propertyResource Path pattern anatomy p1 p1 p2 p3
Path types in SPARQLeR • Directionality of relationships in the path defines its specific semantics. • SPARQLeR allows definition of the following path types • As defined in graph theory • Directed • Undirected • SPARQLeR specific extension • Defined directionality path (includes directed path)
Directionality of properties in path • Defined directionality paths: • Neither directed nor undirected • Each property in a path has a specified directionality • Example: simple graph with p relationship (a) X p* Y, directed path (b) X p* Y, undirected path (c) X ( pp-1 )* Y, directional path (b) (c) (a) p p p p X Y p p p p
Inverse property operator • In standard SPARQL there is no need for inverse property operator • Pattern syntax is based on individual statements, so it is easy to reverse direction. • Defining path constraints requires the inverse operator • A pPath expression defines constraints on properties, not on individual statements. • Without the inverse property operator some paths constraints would be impossible to express (as shown in the previous example).
RegExp in path constraints • Path constraints on properties are based on regular expressions • Uses syntax similar to lex • Easy for grep users • Examples: a c* d a+ (b|c) a [abc] c? d ( b a-1 )+ c
Path constraints in SPARQLeR • Defined as regular path expressions • Can specify patterns of properties in the path • Directionality requirement needs the inverse operator (‘-’ minus) –p • Supported regular expressions p (single property) -p (the inverse of p) [p1 p2 ... pn] (class of properties) -[p1 p2 ... pn] (class of inverse properties) [^p1 p2 .. pn] (complement of properties) -[^p1 p2 .. pn] (inverse of complement of properties) . (wildcard) x | y (alternative) xy (concatenation) x* (Kleene star); x+ (one or more repetition) (x) (match a path matched by x)
Path constraints (cont’d) • Class of properties and inverse operator • Complement operator can be applied only to defined properties, not their inverses • Inverse operator • Not allowed inside class of properties • Inverses set created from defined properties • Example: properties: q r s t [^rt] q s –[^qr] t-1 s-1 (inverses) ([^st] | –[^t]) q r q-1 r-1 s-1
Integrating paths into SPARQL • Path variable binds a path • Name begins with ‘%’ instead of ‘?’ • Simple patterns – path between two resources SELECT ?prop WHERE {<r> ?prop <s>} SELECT %path WHERE {<r> %path <s>} • Single source path SELECT %path, ?res WHERE {<r> %path ?res}
Integrating paths into SPARQL • Resources on the path SELECT %path WHERE{<r> %path <s> . %path rdfs:member <e>} SELECT %path WHERE{<r> %path <s> . %path rdf:_1 <p>} • Listing path elements – listoperator SELECT list(%path) WHERE {<r> %path <s>}
Expressing path constraints • Bounded path length • only constants allowed FILTER(length(%path)<5) FILTER(length(%path)>3 && length(%path)<7)
Expressing path constraints • Constraints added as a regular expression filter (existing syntax in SPARQL) regex( pathvariable, pathexpr, pathflags ) FILTER(regex(%path,”.*foo:prop.*”,”uis”)) • Flags: i(instances) s (schema) l (literals) h (match using hierarchy)d (set directionality) u (undirected) • Default flags: d i
Some examples SELECT list(%path), ?res WHERE {<r> %path ?res . %path rdfs:member ?x . ?x foo:locatedInwiki:Europe FILTER(regex(%path,”foo:prop+”)} SELECT list(%path) WHERE {<r> %path <s> . %path rdfms:entityResource ?x . ?x rdf:type foo:Person FILTER(regex(%path,”(foo:prop|foo:rel)+”,”u”)} SELECT list(%path) WHERE {<r> %path <s> FILTER(length(%path)<=6 && length(%path)>=4 && regex(%path,”(foo:prop -foo:rel)+”)}
SPARQLeR Prototype Implementation • Prototype implementation is based on BRAHMS – RDF/S main memory storage. • Path search based on a bi-directional BFS for simple paths. • Checking of path constraints in regex is implemented as a simulation of DFAs. Janik, M. and Kochut, K., BRAHMS: A WorkBench RDF Store And High Performance Memory System for Semantic Association Discovery. ISWC 2005
Implementation details • Each path expression (FILTER regex) is translated into a DFA. • For path between two resources, partial constraints are checked while building the search trie from both endpoints – forward and reverse DFAs • When a path is connected,the forward DFA used to check the full (path) constraint.
Experiments: biology pathway • Biosynthesis paths in biology (glycomics) • How specific glyco peptide is created from a basic structure? • Find pathway between dolichol phosphate and glyco peptide G00009 • Path has 15 reactions (30 hops, as each reaction is represented by its substrates and products) • Only undirected path connects the endpoint resources, but a specific directionality pattern is present RDF representation: sample reactions in the path
Experiments : biology pathway • Functionality test - proof of concept N-glycan biosynthesis pathway SELECT list(%path) WHERE { glyco:dolichol_phosphate %path glyco:glyco_peptide_G00009 . %path rdfs:member enzyo:R05969 FILTER ( length(%path) <= 30 && regex(%path, "((-glyco:has_acceptor_substrate| -glyco:has_reactant) glyco:has_product)*" ) ) } Ontology: GlycO Length: 30 hops Consists of: 15 reactions Search time: milliseconds (less than 1 tick)... courtesy of Dr. Alison Vandersall-Nairn, University of Georgia
Experiments • Scalability • Modified DBLP datasets in RDF (added random citations) • Test on increasing dataset (adding older years of publications) • Search for cited publications (transitive) PREFIX opus:<http://lsdis.cs.uga.edu/projects/semdis/opus#> SELECT ?end_publication WHERE { <http://dblp.uni-trier.de/rec/bibtex/journals/ai/Huber06>%path ?end_publication FILTER ( length(%path)<=26 &®ex(%path, "(opus:cites_publication)*" ) ) } B. Aleman-Meza et. al. Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection. (WWW2006)
Experiments – results: single source paths Search paths up to length 26
C A B D More complex uses of path expressions • Discover connecting paths with a shared node • Path between A and B, length up to 4 • Path between C and D, length up to 4 • Both paths have a shared resource A %path_1 B length(%path_1) <= 4 ?x C %path_2 D length(%path_2) <= 4 %path_1 rdfs:member ?x %path_2 rdfs:member ?x Potential subgraph discovery
SPARQLeR summary • Path expressions • use of regular expressions over properties • Flexible path specification • Undirected • Defined directionality paths • Directed • Length restricted • Complex path patterns • Test of resources and properties on the path • Intersecting paths
Conclusion and future work • SPARQLeR extension fits seamlessly into the current SPARQL syntax. • Performance of path queries is acceptable (if defined expression is highly selective). • Optimization of path queries, complex expressions and multiple paths in query. • Inclusion of context.
SPARQLeRKrys Kochut, Maciej Janik Thank you