850 likes | 1.02k Views
The DBin Semantic Web Platform: An overview G. Tummarello, C. Morbidoni. SeMedia Semantic Web & Multimedia Group Università Politecnica delle Marche, Ancona (Italy) http://semanticweb.deit.univpm.it.
E N D
The DBin Semantic Web Platform: An overview G. Tummarello, C. Morbidoni SeMedia Semantic Web & Multimedia Group Università Politecnica delle Marche, Ancona (Italy)http://semanticweb.deit.univpm.it
Part 1: a rough introduction to the “Semantic Web”, highly biased toward W3C standards and ideas.
A quick look at the semantic web: The “origins”: Tim Berners-Lee, CERN March 1989 Information Management: A Proposal
The “Web” in the “Semantic Web” Mentioning “web” implies a whole scenario: • Distributed • Machine accessible (probably via HTTP) • Consensus and utility driven, based on standards • Technically imperfect (but live with it – 404!) • Potentially unfriendly (live with it and/or take measures)
.. so “Semantic + Web” • It doesn’t absolutely aim at solving the “artificial intelligence” problem • It’s a common effort driven by the great potential benefits (“a little semantic goes a long way”) • A fertile ground where machine generated annotations ( e.g. Classifications, Neural networks , Classic AI ) can naturally and orderly merge with human generated ones.
Basic instruments • Uniform Resource Identifiers To identify “things” that will be subject, object or predicate of our statements. More than “URLs”, e.g. : A book: ISBN: 1234-123443-1234343 Some bytes: ED2K: 54736fa457 My cat: TAG:giovanni@wup.it/25-12-2004/chicca
RDF – expressing relationships between resources Hypothesis: we can express knowledge as a directed and labeled graph, the basic unit being a STATEMENT (a triple) linking (usually) RESOURCES: Subject Predicate Object Observation: Seems nice and uniform More complex statements can be expressed as aggregations of simpler ones.
…forget the XML, it’s a graph: <?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:exterms="http://www.example.org/terms/"> <rdf:Description rdf:about="http://www.example.org/index.html"> <exterms:creation-date>August 16, 1999</exterms:creation-date> </rdf:Description> <rdf:Description rdf:about="http://www.example.org/index.html"> <dc:language>en</dc:language> </rdf:Description> <rdf:Description rdf:about="http://www.example.org/index.html"> <dc:creator rdf:resource="http://www.example.org/staffid/85740"/> </rdf:Description> </rdf:RDF>
Looks powerful already! • If we liked XML (Tree structures are a superset of plain (DB) tables) • ..We got to love RDF (Graphs are supersets of trees) • Standard serializability • Actually simpler (no “attributes” like things)
MD5:5465ad435 MD5:565758 MD5:87854435 MD5:543647667 MD5:ab56477 MD5:34699934 MD5:fa345535 Is it enough? A naked, opaque RDF graph example
So, to add an ontology: RDFS - OWL • Defines terms and concepts used to describe and represent an area of knowledge, and how they are interrelated • Enables more effective automated information processing • Includes concepts such as Classes, Instances, Relationships, Properties, Functions, Constraints • SW philosophy : no centralization, global accessibility Free to make your own! At your own risk!
RDF/RDFS/OWL a quick view RDF: • Resource • Property • Statement RDFS: • Class • type • domain • range • subClassOf OWL: • Property characteristics • Property restrictions • Mapping features • Complex Classes
OWL Property Characteristics • TransitiveProperty • SymmetricProperty • FunctionalProperty • inverseOf • InverseFunctionalProperty
OWL Property Restrictions • allValuesFrom • someValuesFrom • minCardinality • maxCardinality • cardinality • hasValue (OWL DL)
OWL Mapping Features • equivalentClass • equivalentProperty • sameAs • differentFrom • AllDifferent
OWL Complex Classes • intersectionOf • unionOf (OWL DL) • complementOf (OWL DL) • oneOf (OWL DL) • disjointWith (OWL DL)
OWL Example • Class • Person superclass • Man, Woman subclasses • Properties • isWifeOf, isHusbandOf • Property characteristics, restrictions • inverseOf • domain • range • Cardinality • Class expressions • disjointWith
OWL Example In XML <rdf:RDF xmlns="http://owl.protege.stanford.edu#" xmlns:protege="http://protege.stanford.edu/plugins/owl/protege#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:owl="http://www.w3.org/2002/07/owl#"> <owl:Ontology rdf:about=""> <owl:imports rdf:resource="http://protege.stanford.edu/plugins/owl/protege"/> </owl:Ontology> <owl:Class rdf:ID="Person"/> <owl:Class rdf:ID="Man"> <rdfs:subClassOf rdf:resource="#Person"/> <owl:disjointWith> <owl:Class rdf:about="#Woman"/> </owl:disjointWith> </owl:Class> <owl:Class rdf:ID="Woman"> <owl:disjointWith rdf:resource="#Man"/> <rdfs:subClassOf rdf:resource="#Person"/> </owl:Class>
OWL Example In XML (cont.) <owl:ObjectProperty rdf:ID="isHusbandOf" rdf:type="http://www.w3.org/2002/07/owl#FunctionalProperty"> <rdfs:domain rdf:resource="#Man"/> <rdfs:range rdf:resource="#Woman"/> <owl:inverseOf rdf:resource="#isWifeOf"/> <owl:minCardinality>0</owl:minCardinality> <owl:maxCardinality>1</owl:maxCardinality> </owl:ObjectProperty> <owl:ObjectProperty rdf:ID="isWifeOf" rdf:type="http://www.w3.org/2002/07/owl#FunctionalProperty"> <rdfs:domain rdf:resource="#Woman"/> <rdfs:range rdf:resource="#Man"/> <owl:inverseOf rdf:resource="#isHusbandOf"/> <owl:minCardinality>0</owl:minCardinality> <owl:maxCardinality>1</owl:maxCardinality> </owl:ObjectProperty> </rdf:RDF>
Part 2 Exploring a novel scenario: Semantic Web P2P “a la Napster”
SW P2P a la Napster, scenario and possibilities (1) File sharing, P2P “philosophy”: • Downloads a lot, too much not a problem • Shares what downloaded • Uncommitted, no guarantees, join and leave at will • Works mostly unattended and uses background resources • Deals primarily with non time critical use cases But for the Semantic Web: • Exchanges, downloads and serves “metadata” rather than “files” • Searches about user interests rather than file names (“sicilian cucine”, “Scottish pubs”) • Remembers (almost) all grows a local triplestore
SW P2P a la Napster, scenario and possibilities (2) Storing a lot of metadata locally..why? 1) Why not, disk space is very cheap! (and its just metadata) 2) The more we know, the smarter and more independent we become. 3) Key enabler to the global scalability! “use the Semantic Web” without direct network traffic or external computational burden high replication key to decentralized operations 4) Maximally fast and interactive 5) Gets your local CPUs at work! much more powerful than what a server can give you for free, allows sophisticate information processing (reasoning, filters) Will run personalized algorithms for rating and trust Can relate and integrate directly with local resources (SW desktop integration)
Worth a try..Getting busy Troubling questions: How would it really work in detail? Would it be appealing to real users? Or ..just a dead boring geek tool?
A lot of pragmatic decisions A complete Semantic Web application today means… Deliverable integration platform Domain application/GUI Trust policies tools Data flow pipeline RDF signing methodologies Ontology Import Policies RDF P2P transport layer URL Data handling (Up/Down)URI Minting RDF Storage
URL Data Handling and URI Minting URL Data handling (Up/Down)URI Minting
The P2P infrastructure will deal with RDF but: • People want to access pictures, mp3s, files, not just see URI. URL resolving/downloading • Automated uploading also needed! URL Data Handling and URI Minting
RDF Storage URL Data handling (Up/Down)URI Minting RDF Storage
RDF/S Storage • Many choices! • We chose Sesame (SeRQL was schema aware long ago) • Thanks Sesame guys! New features being added.. (See trust filtering, pipelining)
RDF P2P Transport Layer RDF P2P transport layer URL Data handling (Up/Down)URI Minting RDF Storage
Distributed RDF, inherent scalability issues WWW: “Accessing” w3c.org • I know exactly who to ask to • Network traffic= the size of the document, • Computational complexity: neglectable SW: “Something about” w3c.org • Many parties will have something to say Find them/distribute the query/collection traffic • Even worse: Join queries demand that the data is moved completely (or need index previously built in the same way) • Might be very and or unpredictably computationally intensive on the involved parties
One size (P2P model) doesn’t fit all Several SW P2P approaches have been proposed: • Centralized + Crawlers/feeds • Distributed queries (Edutella et Al.) • Distributed RDF storage (RDFPeers) But they do not fit the desired scenario
Centralized + Crawlers/Feeds • Heavyweight centralized • Query limitations • Crawlers have shortcomings • Free as in speech? • Central control might however work wonders for certain applications! (see froogle)
Distributing queries/collecting results “Edutella approach” – Interlibrary scenario: • Query intrinsecally limited • Scalability is to be questioned • But might be the only approach to many common scenarios!
Distributed RDF storage Hashed triples (“RDFPeers”): • Can address huge graphs! • Query limititations (no schema!) vs very high network traffic • Traffic and load on a single peer as a function of the active partecipants
“RDFGrowth” - Design essentials What can’t we expect others to do? • Execute external arbitrary graph queries • Perform active “information hunt” for us. • No replicating queries, no query forwarding or routing. In general, no operations that induce non constant burden • Provide a service if not in a purely “best effort” fashion • No uptime guarantees, no service guarantees
Designing a solution: what do we have plenty? • Storage space! 200.000 MB = 102$ WD 200GB 2000BB UATA100 2MB 7200RPM @ MWave.com 20/8/2004 • So we will store all the metadata we get to know in the internal RAW database! (At least “about” the URI related to subjects we express interest in.). Its RDF, its monotonic.
RDFGrowth is based on A way to define “resources” (URIs) of interest Group URI Exposing Definition (GUEDs) Given a resource (URI), an operator to extract a “small graph surrounding” RDF “Neighbors” function (RDFN) A way to create RDF Hash values RDF Canonical Serialization
Selecting groups of URIs to “talk about”: GUEDs Group URI Exposing Definition: An operator defining subset of URIs Example: Select x where {x} <rdf:type> {<beer:Beer>}
KEY CONCEPT:Minimum Self-contained Graph (MSG) Involves (Def) :An RDF statement involves a name if it has that name as subject or object. MSG (Def). Given an RDF statement s, the Minimum Self-contained Graph (MSG) containing that statement, written MSG(s), is the set of RDF statements comprised of the following: • The statement in question; • Recursively, for all the blank nodes involved by statements included in the description so far, the MSG of all the statements involving such blank nodes;
Information “Surrounding” a URI: RDF “Neighbours” MSG(statement) (approx def). The “blank node closure” of the statement. RDFN (def). The RDFN of a resource is the graph composed by all the MSGs involving the resource itself. Similar to a Concise Bounded Resource Description (CBRD) given in [URIQA], but is differs mainly by the use of the “involves” RDFN(Uri) is the only remote query allowed in RDFGrowth
Locating “News”:RDFN Hash Set RHS(URI)=Hashes(canonicalize(RDFN(URI))) • Concise values exposed to the network to reppresent the knowledge a peer has about a URI • Peers looking for information about a URI use the published RHS to select who to talk to (i.e. the most “interesting” peer) • As long as RDFN1=RDFN2 RHS1=RHS2 the algorithm will converge anyway a simple MD5 does it • .. But adding other euristics increases the speed of convergence
KEL – Knowledge Exchange Layer An abstract “transport driver” implementing the following interface: PUBLISH(URI, RHS) PeersAndRHS[] = LOOKUP(URI) RDF=getRDFN(peer, URI) • Possible implementations: • DHT P2P networks (ideal!) • Server based systems (Jabber.. JaSiMPa) • Newsgroups, Mailing lists, Freenet (?) etc..
A glimpse at the algorithm For each URI in the matched by the GUED: • Look UP remote RHS • Request RDFN from the “most interesting”, recalculate and merge again while there still is new knowledge about that URI • [optional ]Are you reasonably sure you have “news”, issue a “broadcast”
The compromise, in a single glance: We trade: Disk space. Processor power Real time behaviors. Determinism in knowledge updates delays Startup time (unless advanced techniques are employed) We expect: To be able to browse fast and as much as we want. To be able to eventually “get to know” To be able eventually “reach our audience” Our local computational burdain to remain constant or increase modestly as a function of the peer number.