210 likes | 241 Views
Characterizing Knowledge on the Semantic Web with Watson. Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta The Knowledge Media Institute, The Open University m.daquin@open.ac.uk. The Semantic Web is Growing.
E N D
Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta The Knowledge Media Institute, The Open University m.daquin@open.ac.uk
The Semantic Web is Growing Lee, J., Goodwin, R. (2004) The Semantic Webscape: a View of the Semantic Web. IBM Research Report.
The Semantic Web is growing… http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
Next Generation Semantic Web applications Need for a Gateway to the Semantic Web Exploiting the Semantic Web rather than engineering their own knowledge/ontologies
More on Watson? See also… • Watson Web Interface:http://watson.kmi.open.ac.uk • Watson poster and demoat ISWC 2007…
Characterizing Knowledge in Watson? • Beside being a gateway for applications, Watson gives the opportunity to better understand: • How semantic technologies are used to published knowledge online • How knowledge is structured on the Semantic Web • How ontologies and semantic documents are interconnected in a semantic network • through an analysis of its repository. • Such an analysis provides valuable information for application and tool developers concerning the knowledge they have to manipulate.
The Watson Collection • Collecting Semantic Content: • A number of specialized crawlers for Google, ontology repositories (e.g. Swoogle), PingTheSemanticWeb, etc. • Validated by parsing with Jena, to get only RDF documents • Filters: • Before filtering, the repository was composed almost entirely of RSS and FOAF (more than 5 times the number of other documents) • Therefore, the analysis would have been more an analysis of RSS and FOAF than anything else. • These have been filtered out. • An analysis of the FOAF part of the repository separately would be interesting.
The Watson Collection Result: almost 25,500 semantic documents
The Watson Collection • In order to index these documents, Watson extracts information about them. • Information about the content: classes, properties and individuals, the relations between them, the coverage in terms of domain topics, etc. • Information about the representation: the language used and its expressivity, the size and structure of the document, etc. • Information about the network aspects of semantic documents: identification, links between documents, etc. • It is these elements of information that we intend to analyse. • Note that all these elements of information are freely available through the Watson API.
In the Following Measures on the following aspects: • Usage of semantic technologies to publish knowledge on the Web • Structure and coverage of semantic documents • The knowledge network Focusing more on the most “debatable” elements.
Semantic Web languages… • Here a document is considered in a given language if it instantiates an entity of the language • The majority is factual data in RDF • OWL adopted as ontology language • Less overlap between OWL and RDF-S than between DAML+OIL and RDFS: • better separation of the meta-models in OWL • e.g. it is in OWL and RDF-S if it contains an owl:property and an rdfs:class for example
… and their expressivity • Apparent contradiction: • Most of the documents are in OWL FULL • But 95% use only a very restricted part of the expressive power of OWL (below OWL Lite) • OWL Full because of simple syntactic mistakes
Size of the documents • Like for expressivity, a power law distribution: lots of very small document and a few very large ones (both for ontological knowledge and factual data, but on different scales) Number of classes Number of instances Documents Documents
Density of the representation • In average, classes are: • Poorly defined (small number of properties and super-classes per class) • Highly instantiated (high number of instances per class) Even the best represented class in each ontology only have 1 property in avg.
Topic Domain Coverage • Level of coverage of ontologies for the top categories in DMOZ (details in the paper) • Very heterogeneous distribution • Not well correlated with the one of the Web
Identification of semantic document • Participates to the networkedand distributed aspects of the Semantic Web • URI are unique identifiers, but when applied to ontologies, they may be duplicated: • Default URI of the ontology editor (Protégé) • Misuse of the URI of existing vocabularies (OWL) • Different versions of an ontology having the same URI • Also, it is a good practice for URIs to be dereferenceable, but only 30% of the semantic documents can be reached through their URI.
Connectedness and Redundancy • Connectedness and redundancy are both important aspects of distributed systems. • Connectedness: • A few large providers (W3.org, Stanford) and a few locally dense networks (Ontoworld) • Otherwise, very local ontologies • Redundancy: • Almost 30% of the semantic documents are duplicates • 12% of the entities are described more than once • Abetter support of the network aspects of ontologies is required.
Conclusion • Our analysis allows to draw some conclusions about some of the characteristics of the knowledge published online. • In particular, it shows that • Semantic Web documents tend to be small, lightweight and weakly structured • Efforts are still required to publish knowledge in a variety of domains • The network aspects are not taken enough into consideration in semantic technologies • These constitute valuable information for tools and applications developers.
Limitations • This work can be seen as a first step towards a fine grained characterization of the Semantic Web. • But in its current state, it suffers from a number of limitations: • Only a sample of the Semantic Web • A snapshot of the current dataset. Should consider evolution • Simple analysis methods. Would data mining approaches be relevant? • The analyzed aspects are insufficient to fully capture the quality of the knowledge available online
A last word… • We believe that the field of evaluation of ontologies and ontology based tools could provide valuable inputs to this study, so please: • Watson is an open system, our data is available through the Watson API. Comment, suggest, question… http://watson.kmi.open.ac.uk m.daquin@open.ac.uk