COMPSCI 732: Semantic Web Technologies

COMPSCI 732:Semantic Web Technologies Semantic Web Architecture

Where are we?

Overview • Introduction and motivation • Technical solutions • Semantic Web architecture • Uniform Resource Identifier • eXtensible Markup Language (XML) • XML Schema • Namespaces • Extensions • Illustration by a large example • Summary • References

INTRODUCTION AND MOTIVATION

A Semantic Web Scenario From Today • Queries: • Which type of music is played by UK radio stations? • Which UK radio station is playing titles by Swedish composers? • Information to answer query is available on the Web • Web search engines analyze Web content one page at a time • The Semantic Web provides better framework to answer such queries • combines data • distributed across different sources, and • described in machine-interpretable manner

Steps in Answering Queries • Playlists of BBC radio shows published online in Semantic Web formats • Music groups such as “ABBA” have an identifierhttp:://www.bbc.co.uk/music/artists/d87e52c5-bb8d-4da8-b941-9f4928627dc8#artist • Identifier can relate music group to information at Musicbrainz • Music community portal exposing data on Semantic Web • http://musicbrainz.org • Knows about band members (e.g. Benny Andersson) • Aligns its information with Wikipedia • Information on UK radio stations may be found in lists on Web pages • Can be translated into similar Semantic Web representation

Describing Things and Their Relationships • Meaning of Relationships, e.g., band memberships explained online, too • Using collections of Ontologies available on the Web • Dublin Core (general properties of information resources) http://dublincore.org/ • SKOS (covering taxonomic descriptions) http://www.w3.org/2004/02/skos/ • Specialized ontologies (covering the music domain) • Data at the BBC currently use at least nine different ontologies • http://www.bbc.co.uk/ontologies/programmes • Availability of data in these formats enables queries to be answered • Based on a query language

Towards the Required Infrastructure • What infrastructure is required to implement the scenario from before? • Generic software components, Languages, Protocols • Their seamless interaction to satisfy requests • Purpose of Lecture: • Investigate Semantic Web Architecture • Analyze requirements from technical need to identify and relate data • Analyze organizational needs to maintain Semantic Web as a whole

Web Architecture • The Semantic Web is an evolution of the Web • Important for the fast growth and adoption of the Web are • Many people can set up Web servers easily and independently from each other • More people can create documents, put them online, and link them to each other • Even more people can browse and access any Web server to retrieve documents • Web architecture allows graceful degradation of user experience when: • Network is partially slow (World Wide Wait), while other parts still operate at full speed • Single Web servers break, because others still work • Hyperlinks are broken, because other links still lead somewhere • Separation of concerns justifies less quality outputs • Users can easily create and access documents • Distributed nature of system, without need of central coordinator, results in robustness

Web Architecture Principles • Explicit simple data representation • Common data representation hides underlying technologies (e.g. HTML) • Distributed system • Data sources without centralized instance controlling who owns what type of info • Distributed ownership and control can facilitate adoption and scalability • E.g. Web pages are under full control of their producers • Cross-referencing • Reuse of existing data and data definitions from different authorities (e.g. hyperlinks) • Loose coupling achieved by common language layers • Communication in standardized languages • These must be easy to customize • Overall communication must not be jeopardized by such specialization • E.g. Coupling of Web clients/servers: HTTP for transport, HTML for Web content • Ease of publishing and consumption • Easy publishing and consumption of simple data • Comprehensive publishing and consumption of complex data, e.g.:HTML simple to convey textual info; powerful browsers/content management systems

Semantic Web Requirements and Examples • Must be able to represent entities and their relationships (1) • A person, the birthday of a person, the name of a person (“Benny Andersson”) • Must be serializable in standardized manner to easily exchange data between different computing nodes (1,2,4) • Ease of joining information from MusicBrainz, BBC, DBPedia • Entities must be referable across borders of ownership or computing systems to allow for cross-linking of data (1,2,3,4) • ABBA’s Benny Andersson becomes hard to distinguish from other Benny Anderssons • Expressive, machine-understandable data description language (1,4,5) • Manual inspection not scalable; refinements of basic model impossible • BBC Data involves radio stations, shows, their versions, songs and their artists • A query and manipulation language to select and aggregate data (5) • The number of Swedish composers being broadcast on a specific program • Reasoning desirable to facilitate querying (5) • Direct relationship between a program and a song using inference • Transport of data and query and their results by agreed-upon protocols (HTTP) • May involve encrypted data requests and transports (HTTPs); signature of data items to ensure authenticity of user requests and control access to resources

Additional Requirements • Core requirements not yet included in language architecture • Versatile means for user interaction • Broad accessibility requires viewing, searching, browsing, querying of data • While at the same time abstracting from intricacies underlying their distributed origin • On-the-fly data integration of multiple data sources: assemble information from multitude of sources without a priori knowledge about domain or structure of data • Facilitation of data production and publishing: metadata creation and migration of data must be made convenient, independent from origin of data • Provenance and Trust • Authorship and ownership get lost during data processing and aggregation • Origin, Reliability, Trustworthiness must be rethought to apply them for individual and aggregated data items, to establish faithful authentication at Semantic Web scale • Alignment of unconnected sets of data • Interlinking implies capability to suggest alignments between identifiers or concepts from different sets of data, beyond mere use of identifiers such as URI/IRIs • Such alignment may be necessary to enable a real Web of Data

Semantic Web Architecture • Formalized components and their relationships • What technologies make up the Semantic Web • What are the dependencies between components • Roadmap for steps of developing the Semantic Web

The Semantic Web architecture and its foundations TECHNICAL SOLUTION

Search and Query the Web I • The Web is a constantly growing network of distributed resources • More than 1 trillion unique URLs • More than 100 billion pages • More than 200 million web sites • Check most updated data on: http://news.netcraft.com/archives/web_server_survey.html • User needs to be able to efficiently search resources/content over the Web • When I Google “Milan” do I find info on the city or the soccer team? • User needs to be able to perform query over largely distributed resources • When is the next performance of the rock band “U2”, where it will be located, what are the best ways to reach the location, what are the attractions nearby…

Search and Query the Web II • On2Broker is the evolution of Ontobroker, a systems that aims at providing a solution to the problems discussed in the previous slides by adopting Semantic Technologies • On2Broker is a system that processes distributed information sources and that provides intelligent information retrieval, query answering • On2Broker relies on components of the Semantic Web Architecture [D. Fensel, S. Decker, M. Erdmann, R. Studer: Ontobroker in a Nutshell. ECDL 1998: 663-664]

On2Broker: Architecture

On2Broker Components I • Query Interface • Provides a structured input that enables users to define their queries without any knowledge of the query language • Input queries are then transformed to the query language (e.g. SparQL) • Repository • Decouples query answering, information retrieval and reasoning • Provide support for materialization of inferred knowledge

On2Broker Components II • Crawlers and Wrappers (or Info Agent) • Extract knowledge from different distributed and heterogeneous data sources • RDFa pages and RDF repositories can be included directly • HTML and XML data sources require processing by wrappers to derive RDF data • Inference Engine • Relies on knowledge imported from the crawlers and axioms contained in the repository to support query answers • Adopts Horn logic and closed world assumption

On2Broker: Example Tim Berners-Lee knows Christian Bizer and Tom Heath 1. Whom does Tim Berners-Lee know? 2. SELECT DISTINCT ?s ?o WHERE { ?s foaf:knows ?o . } … • Extract RDF from: http://www.w3.org/People/Berners-Lee/dblp… • Extract RDF from: fensel.comdblp… • Extends KB:if “x dblp:coauthor y“ then “x foaf:knows y” • if “y foaf:knows x“ then “x foaf:knows y”

SemWeb Architecture: Requirements • Extensibility • Each layer should extend the previous one(s) • Support for data interchange • Using data from one source in other applications • Support for ontology description with different complexity • Including rules • Support for data query • Support for data provenance and trust evaluation see the Semantic Web Roadmap: http://www.w3.org/DesignIssues/Semantic.html

Semantic Web Stack Rules: RIF Adapted from http://en.wikipedia.org/wiki/Semantic_Web_Stack

UNICODE, URI and XML • UNICODE is the standard international character set • E.g. used to encode the data in the repository • Uniform Resource Identifiers (URIs) identify things and concepts • E.g. used to identify resources on the Web and in the repository • Be aware to distinguish between information and non-information resources • http:://www.bbc.co.uk/music/artists/d87e52c5-bb8d-4da8-b941-9f4928627dc8#artist vs. http://dbpedia.org/resource/ABBA • Data publishers on the Semantic Web use Linked data principles: • Use URIs as names for things • Use HTTP URIs so that people can look up those names • When someone looks up a URI, provide useful information, using standards (RDF,SPARQL) • Include links to other URIs, so that they can discover more things. • eXtensible Markup Language (XML) used for data exchange • Used on the Semantic Web to exchange the description of resources • E.g. format that can be transformed into RDF and imported into the repository

RDF, RDFS and OWL • Resource Description Framework (RDF) • is the HTML of the Semantic Web • Simple way to describe resources on the Web • Based on triples <subject, predicate, object> • Various serializations, including one based on XML • A simple ontology language (RDFS) • E.g. language used to store the data in the repository • More in lecture 3 • Web Ontology Language (OWL) • Is a more complex ontology language than RDFS • Layered language based on Description Logics • Overcomes some RDF(S) limitations • E.g. ontology language used to define the schemas used in repository • More in lecture 7

RDF Graph Encoding a Description of ABBA

RDF Serialized in RDF/XML <?xml version=“1.0”> <!DOCTYPE rdf:RDF[ <!ENTITY bbca “http://www.bbc.co.uk/music/artists/”> <!ENTITY bbci “http://www.bbc.co.uk/music/images/artists/”> <!ENTITY mba “http://musicbrainz.org/artist/”>]> <rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:owl=“http://www.w3.org/2002/07/owl#” xmlns:foaf=“http://xmlns.com/foaf/0.1/” xmlns:mo=“http://purl.org/ontology/mo/”> <mo:MusicArtist rdf:about=“http:://www.bbc.co.uk/music/artists/d87e52c5-bb8d-4da8-b941-9f4928627dc8#artist”> <rdf:type rdf:resource=“http://purl.org/ontology/mo/MusicGroup”/> <foaf:name>ABBA</foaf:name> <foaf:homepage rdf:resource=“http://www.abbasite.com/”/> <mo:image rdf:resource=“&bbci;542x305/d87e52c5-bb8d-4da8-b941-9f4928627dc8.jpg”> <mo:member rdf:resource=“&bbca;042c35d3-0756-4804-b2c2-be57a683efa2#artist”> <mo:member rdf:resource=“&bbca;2f031686-3f01-4f33-a4fc-fb3944532efa#artist”> <mo:member rdf:resource=“&bbca;aebbb417-0d18-4fec-a2e2-ce9663d1fa7e#artist”> <mo:member rdf:resource=“&bbca;ffb77292-9712-4d03-94aa-bdb1d4771d38#artist”> <mo:musicbrainz rdf:resource=“&mba;d87e52c5-bb8d-4da8-b941-9f4928627dc8.html”> <mo:wikipedia rdf:resource=“http://en.wikipedia.org/wiki/ABBA”> <owl:sameAs rdf:resource=“http://dbpedia.org/resource/ABBA”> </mo:MusicArtist> </rdf:RDF>

RDF Serialized in Turtle @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix mo: <http://purl.org/ontology/mo/> . <http:://www.bbc.co.uk/music/artists/d87e52c5-bb8d-4da8-b941-9f4928627dc8#artist> rdf:type mo:MusicArtist, mo:MusicGroup ; foaf:name “ABBA” ; foaf:homepage <http://www.abbasite.com/> ; mo:image <http://www.bbc.co.uk/music/images/artists/542x305/d87e52c5-bb8d-4da8-b941-9f4928627dc8.jpg> ; mo:member<http://www.bbc.co.uk/music/artists/042c35d3-0756-4804-b2c2-be57a683efa2#artist> , <http://www.bbc.co.uk/music/artists/2f031686-3f01-4f33-a4fc-fb3944532efa#artist> , <http://www.bbc.co.uk/music/artists/aebbb417-0d18-4fec-a2e2-ce9663d1fa7e#artist> , <http://www.bbc.co.uk/music/artists/ffb77292-9712-4d03-94aa-bdb1d4771d38#artist> ; mo:musicbrainz <http://musicbrainz.org/artist/d87e52c5-bb8d-4da8-b941-9f4928627dc8.html> ; mo:wikipedia <http://en.wikipedia.org/wiki/ABBA> ; owl:sameAs <http://dbpedia.org/resource/ABBA> .

RDFS and OWL Example • Reasoning example in RDFS • rdfs:subClassOf can model class hierarchies • mo:MusicGroup and mo:MusicArtist specify two classes • Axiom <mo:MusicGroup, rdfs:subClassOf, mo:MusicArtist> • Stating that ABBA is an instance of type MusicGroup enables reasoners to conclude that ABBA is also an instance of type MusicArtist • When query asks for all MusicArtists, then ABBA will be contained in query result, even though there is no explicit assertion of this • Reasoning example in OWL • owl:sameAs can be used to specify that two resources are identical • To consolidate information about ABBA from multiple sources we can specify thathttp:://www.bbc.co.uk/music/artists/d87e52c5-bb8d-4da8-b941-9f4928627dc8#artistand http://dbpedia.org/resource/ABBA are the same

SPARQL and Rule Languages • SPARQL • Query language for RDF triples • A protocol for querying RDF data over the Web • E.g. language used to query the repository from the user interface • Can also be used for Updates • More in lecture 6 • Rule languages (esp. Rule Interchange Format RIF) • W3C recommendation for exchanging rule sets between rule engines • Extend ontology languages with proprietary axioms • Based on different types of logics • Description Logic • Logic Programming • E.g. used to enable reasoning over data to infer new knowledge • More in lecture 8

SPARQL Example • SPARQL query for other music groups that members of ABBA sing in PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIXfoaf: <http://xmlns.com/foaf/0.1/> PREFIXmo: <http://purl.org/ontology/mo/> SELECT ?memberName ?groupName WHERE { <http:://www.bbc.co.uk/music/artists/d87e52c5-bb8d-4da8-b941-9f4928627dc8#artist> mo:member ?m . ?x mo:member ?m . ?x rdf:typemo:MusicGroup . ?m foaf:name ?memberName . ?x foaf:name ?groupName } FILTER (?groupName <> “ABBA”)

SPARQL Example • SPARQL query for other music groups that members of ABBA sing in • Graphical representation of WHERE clause • <http:://www.bbc.co.uk/music/artists/d87e52c5-bb8d-4da8-b941-9f4928627dc8#artist> • mo:member ?m . • ?x mo:member ?m . • ?x rdf:type mo:MusicGroup . • ?m foaf:name ?memberName . • ?x foaf:name ?groupName

Two RIF rules for mapping FOAF predicates • True statements in antecedent of rule mean true statements in its conclusion if { ?x foaf:firstName ?first; foaf:surname ?last; } then { ?x foaf:family_name ?last; foaf:givenname ?first; foaf:namefunc:string-join(?first “ ” ?last) } if { ?x foaf:name ?name } and pred:contains(?name, “ ”) } then { ?x foaf:firstNamefunc:strong-before(?name, “ ”); foaf:surnamefunc:strong-after(?name, “ ”) }

Logics, Proof and Trust • Security and Encryption • HTTPs provides data integrity and confidentiality when transmitting data and queries • Digital signing of RDF graphs provides authenticity and non-repudiation • Unifying logic • Bring together the various ontology and rule languages • Connect unlinked data to provide more meaning to data, and drive data integration • E.g. identity management and alignment via http://sameas.org • Proof • Explanation of inference results, data provenance • Trust • Trust that the system performs correctly • Trust that the system can explain what it is doing • Network of trust for data sources and services • Technology and user interface • Many open problems, topics for future research

Foundations Rules: RIF

More than a-z, A-Z UNICODE

Character Sets ASCII – 7 bit, 128 characters (a-z, A-Z, 0-9, punctuation) Extension code pages – 128 chars (ß, Ä, ñ, ø, Š, etc.) Different systems, many different code pages ISO Latin 1, CP1252 – Western languages (197 = Å) ISO Latin 2, CP1250 – East Europe (197 = Ĺ) Code page is an interpretation, not a property of text Swedish programmer would have to write ä aÄiÜ='Ön'; ü instead of { a[i]='\n'; } Thus if we do not interpret correctly the code page, the result visualized will not be the expected one

UNICODE: an unambiguous code • We need a solution that can be unambiguously interpreted, i.e. whether a code corresponds to a single character and vice versa • That’s why UNICODE was created! $ Å Ĺ Æ ή U+0024 U+00C5 U+0139 U+00C6 U+03AE ك ⅝ ♥ Жญ U+0643 U+215D U+2665 U+0416 U+0E0D

UNICODE ISO standard About 100,000 characters, space for 1,000,000 Unique code points from U-0000 through U-FFFF to U-10FFFF Well-defined process for adding characters When dealing with any text, simply use UNICODE Character code charts: http://www.unicode.org/charts/ See also: http://www.tbray.org/talks/rubyconf2006.pdf http://tbray.org/ongoing/When/200x/2003/04/06/Unicode

URI: UNIFORM RESOURCE IDENTIFIERS How to identify things on the Web

Identifier, Resource, Representation Taken from http://www.w3.org/TR/webarch/

URI, URN, URL • A Uniform Resource Identifier (URI) is a string of characters used to identify a name or a resource on the Internet • A URI can be a URL or a URN • A Uniform Resource Name (URN) defines an item's identity • the URN urn:isbn:0-395-36341-1 is a URI that specifies the identifier system, i.e. International Standard Book Number (ISBN), as well as the unique reference within that system and allows one to talk about a book, but doesn't suggest where and how to obtain an actual copy of it • A Uniform Resource Locator (URL) provides a method for finding it • the URL http://www.auckland.ac.nz identifies a resource (UoA's home page) and implies that a representation of that resource (such as the home page's current HTML code, as encoded characters) is obtainable via HTTP from a network host named www.auckland.ac.nz

URI Syntax • Examples • http://www.ietf.org/rfc/rfc3986.txt • mailto:John.Doe@example.com • news:comp.infosystems.www.servers.unix • telnet://melvyl.ucop.edu/ • URI Syntax scheme: [//authority] [/path] [?query] [#fragid] • The scheme distinguishes different kinds of URIs • Authority normally identifies a server • Path normally identifies a directory and a file • Query adds extra parameters • Fragment ID identifies a secondary resource

URI Syntax cont’d • Reserved characters (like /:?#@$&+* ) • Many allowed characters • Rest percent-encoded by UTF-8 • http://google.com/search?q=technikerstra%C3%9Fe • IRI – Internationalized Resource Identifier • Allows whole UNICODE • Specifies transformation into URI – mostly UTF-8 encoding

URI Schemes • Schemes partition the URI space into subspaces • Schemes can add or clarify properties of resources • Ownership (how authorities are formed) • Persistence (how stable the URIs should be) • Protocol (default access protocol) From http://www.iana.org/assignments/uri-schemes.html

How to exchange structured data on the Web XML: EXTENSIBLE MARKUP LANGUAGE

eXtensible Markup Language Language for creating languages “Meta-language” XHTML is a language: HTML expressed in XML W3C Recommendation (standard) XML is, for the information industry, what the container is for international shipping For structured and semistructured data Main plus: wide support, interoperability Platform-independent Applying new tools to old data

Structure of XML Documents Elements, attributes, content One root element in document Characters, child elements in content

XML Element • Syntax <name>contents</name> • <name> is called the opening tag • </name>is called the closing tag • Examples <gender>Female</gender> <story>Once upon a time there was…. </story> • Element names case-sensitive

Attributes to XML Elements • Name/value pairs, part of element contents • Syntax <name attribute_name="attribute_value">contents</name> • Values surrounded by single or double quotes • Example <temperature unit="F">64</temperature> <swearword language='fr'>con</swearword>

Empty Elements Empty element: <name></name> This can be shortened: <name/> Empty elements may have attributes Example <grade value='A'/>

COMPSCI 732: Semantic Web Technologies