470 likes | 708 Views
From data source to Guided Exploration: a tool stack for Semantic Web navigation May 2006 Aduna, Jeen Broekstra jeen.broekstra@aduna-software.com. Knowledge Systems Course. Time table. [8:45 – 9:00] Introduction Aduna RDF and the Semantic Web [9:00 – 9:30] Software stack: Middleware
E N D
From data source to Guided Exploration: a tool stack for Semantic Web navigation May 2006 Aduna, Jeen Broekstra jeen.broekstra@aduna-software.com Knowledge Systems Course Knowledge Systems Course
Knowledge Systems Course Time table • [8:45 – 9:00] Introduction • Aduna • RDF and the Semantic Web • [9:00 – 9:30] Software stack: Middleware • Sesame: storage and querying for RDF • Aperture: retrieving metadata from data • [9:45 – 10:15] Software stack: Presentation • Spectacle: RDF-based Facet Navigation • AutoFocus: Cluster Visualization • [10:15 – 10:30] Demo + discussion
Knowledge Systems Course About Aduna • Where are we: • Amersfoort, the Netherlands • What do we do: • Develop software for effective navigation and visualization of large information sources • Use Semantic Web technology to enable better search
Knowledge Systems Course Aduna and Software • Software Components: • Aperture • a framework for extracting metadata from various kinds of sources (e.g. Word files, E-mail, PDF, images,…) • Sesame • a toolkit/database for scalable storage and querying of RDF, RDFS and OWL • Spectacle • efficient facet navigation • Cluster Map • visualization component
Knowledge Systems Course RDF in one slide • Data model for expressing knowledge • basic building block: statement <person001> <name> “Jeen” . • groups of statements form graphs name Jeen person001 email j.broekstra@tue.nl worksIn projectMemberEmail name project001 Sesame
Knowledge Systems Course RDF Schema in one more slide • RDF Schema is a Vocabulary Description Language • it allows specification of domain vocabulary and a way to structure it • Class, Property, subClassOf, subPropertyOf, domain, range • Formal semantics add simple reasoning capabilities: • class and property subsumption • domain and range inference rdfs:Class rdf:type rdf:Property Person rdf:type rdfs:domain rdfs:subClassOf name Researcher rdf:type person001
Knowledge Systems Course presentation middleware The tool stack Sesame metadata storage and reasoning Aperture metadata extraction
Knowledge Systems Course Aperture
Knowledge Systems Course What is Aperture? • Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems (e.g. file systems, web sites, mail boxes) and the file formats (e.g. documents, images) occurring in these systems. • Open Source project by Aduna and DFKI:http://aperture.sourceforge.net/
Knowledge Systems Course Aperture Features • Crawl information systems such as file systems, websites, mail boxes and mail servers • Extract full-text and metadata from many common file formats • View files in their native applications • Ease of use: easy to learn, easy to code, easy to deploy in industrial projects • Flexible architecture: can be extended with custom file formats, data sources, etc., with support for deployment on OSGi platforms • Data exchange based on Semantic Web standards (e.g. RDF, SPARQL, ...)
Knowledge Systems Course Supported File Formats • Plain text • HTML, XHTML • XML • PDF (Portable Document Format) • RTF (Rich Text Format) • Microsoft Office: Word, Excel, Powerpoint, Visio, Publisher • Microsoft Works • OpenOffice 1.x: Writer, Calc, Impress, Draw • StarOffice 6.x - 7.x+: Writer, Calc, Impress, Draw • OpenDocument (OpenOffice 2.x, StarOffice 8.x) • Corel WordPerfect, Quattro, Presentations • Emails (.eml files)
Knowledge Systems Course The Sesame Framework
Knowledge Systems Course What is Sesame? • A framework for storage, querying and inferencing of RDF and RDF Schema • A Java Library for handling RDF • A Database Server for (remote) accessto repositories of RDF data • Open Source project by Adunahttp://www.openRDF.org/
Knowledge Systems Course Sesame features • Light-weight yet powerful Java API • Highly expressive query and transformation languages • SeRQL, SPARQL • High scalability (O(107) RDF triples on desktop hardware) • Various backends • Native Store • RDBMS (MySQL, Oracle 10, DB2, PostgreSQL) • main memory • Reasoning support • RDF Schema reasoner • OWL DLP (OWLIM) • domain reasoning (custom rule engine) • Rio Toolkit: parsers and writers for different RDF syntaxes: • RDF/XML, Turtle, N3, N-Triples, TriX
Knowledge Systems Course Sesame 2 architecture application HTTP / SPARQL protocol application HTTP Server Repository Access API SeRQL SPARQL SAIL API Rio SAIL Query Model RDF Model
Knowledge Systems Course Sesame 2 architecture application Remote apps can communicate overthe Web with a Sesame server and update data or do queries HTTP / SPARQL protocol application HTTP Server Allows deployment of Sesame as a web-enabled database server (e.g. in Tomcat). Implements a superset of SPARQL protocol (HTTP REST) Local apps can just include (parts of) Sesame as a Java library and use it to process RDF data efficiently. Repository Access API Main Access API of SesameOffers developer-friendly methods for manipulating RDF data (query, adding, removing, updating) SeRQL SPARQL Declarative Querying and other ‘higher-level’ functions on SAILs SAIL API Rio SAIL Query Model Storage And Inference Layer System API for ‘wrapping’ storage backend RDF I/O Set of parsers and writers for RDF/XML, Turtle, N3, N-Triples.Can be used separately. RDF Model The core RDF model, containing objects and interfaces for URIs, blank nodes, literals, statements.
Knowledge Systems Course The SAIL API • Storage And Inferencing Layer • Abstraction from physical storage • allows other Sesame components to function on any type of store • can be used as a wrapper layer for aparticular data source • System Internal API • application developers typically do not use it directly
Knowledge Systems Course The Repository Access API • A single Java object representation for a Sesame database, offering methods for • evaluating a query and retrieving the result • adding RDF data from local file, from the web, as a text string, etc. • adding/removing (sets of) RDF statements • starting/stopping transactions
Knowledge Systems Course Querying RDF • RDF is a labeled, directed graph of semistructured data • no rigid schema • An RDF query language needs to be able to address this: • graph path expressions • dealing with semistructured nature of RDF • flexible querying of both data and schema
Knowledge Systems Course SeRQL • Language proposal based on best practices • Redesign of RQL to make it easier to use, incorporating ideas from many other query languages • Developed in the Sesame project • Expressive language, but still fairly easy to use • Support for RDF Schema • Implementation: Sesame
Knowledge Systems Course Netherlands hasCapital Amsterdam areacode 020 SeRQL path expressions • {X} geo:hasCapital {geo:Amsterdam} • {X} geo:hasCapital {Y} • {X} P {Y}
Knowledge Systems Course Netherlands hasCapital Amsterdam areacode 020 Chaining, branching and comparing • Chaining: • {X} geo:hasCapital {Y} geo:areacode {Z} • Branching: • {X} rdf:type {Y}; geo:areacode {Z} • Comparison operators: • String comparison: • X like “*Netherlands” • Y like “A*” • boolean comparison: • X < Y, X <= Y, Z < 20, Z = Y, etc.
Knowledge Systems Course SeRQL query composition • Using the building blocks, we can compose complex queries. • SeRQL uses a select-from-where syntax SELECT X, Y FROM {X} geo:hasCapital {Y} geo:areacode {Z} WHERE Z like “020” USING NAMESPACE geo = <http://www.geography.org/schema.rdf#>
Knowledge Systems Course Optional path expressions • RDF is semi-structured • Even when the schema says some object should have a particular property, it may not always be present in the data: • Persons have names and email addresses, but Lora is a person without a known email address name Jeen Person type type email person001 j.broekstra@tue.nl person002 Lora name
Knowledge Systems Course Optional path expressions (2) • To be able to query for all persons, their first names, and if known their email address, SeRQL introduces optional path expressions: • SELECT • Person, Name, Email • FROM • {Person} my:name {Name}; • [my:email {Email}]
Knowledge Systems Course CONSTRUCT queries • CONSTRUCT-queries return RDF statements • each RDF statement matching the query pattern is returned • The query result is • a subgraph of the original graph, or; • a transformed graph • This mechanism also allows formulation of simple rules
Knowledge Systems Course SeRQL construct-queries Subgraph query: CONSTRUCT * FROM {X} geo:hasCapital {Y} hasCapital Netherlands Amsterdam Transformation query: CONSTRUCT {Y} my:inCountry {X} FROM {X} geo:hasCapital {Y} inCountry Amsterdam Netherlands
Knowledge Systems Course SeRQL vs. SPARQL • Both: expressive query and transformation language • SELECT and CONSTRUCT • optional path expressions • support for context/named graphs • SeRQL (“circle”) • nested queries, language tags, … • user-friendly syntax (but YMMV) • very efficient Sesame implementation • SPARQL (“sparkle”) • W3C Standard (in progress) • tool interoperability: Jena, Redland, 3Store, Sesame, …
Knowledge Systems Course SeRQL vs. SPARQL example SELECT X, Y FROM {X} geo:hasCapital {Y} geo:areacode {Z} WHERE Z like “020” USING NAMESPACE geo = <http://www.geography.org/schema.rdf#> PREFIX geo: <http://www.geography.org/schema.rdf#> . SELECT ?x ?y WHERE { ?x geo:hasCapital ?y . ?y geo:areacode ?z . FILTER (?z = “020”). }
Knowledge Systems Course Presentation How to navigate ontology-based information
Knowledge Systems Course An ontology is not enough • End users do not necessarily think in the same terms in which an ontology is modeled • Search and Navigation tools need to provide for allowing user-oriented access to the information • views • multiple access paths • recognizable options • quick results
Knowledge Systems Course Navigation problems 1 • Too many links or categories • overwhelming offer • Deep hierarchies • information remains hidden
Knowledge Systems Course Examples
Knowledge Systems Course Navigation problems 2 • Query overspecification • zero results! • Query underspecification • millions of hits!
Knowledge Systems Course Examples
Knowledge Systems Course Faceted navigation 1 • Facet = meta-data element • e.g. 'author', 'title', 'date‘, ‘type’ • Facets have values • e.g. 'author is J. Brown' • In collections facet values are related • e.g. author 'J. Brown' is connected to title 'Once upon a time ...' • Faceted navigation = chose a facet value an see all related facets and values
Knowledge Systems Course Faceted navigation 2 • Problem solved • user has problems specifying query • over- and underspecification • Solution • showing all options • give ways to drill down the information • Applied • database selection (e.g. job sites), e-commerce (e.g. travel), enhancement of (full text) search
Knowledge Systems Course Example of faceted navigation Facet: Type Facet values: Adobe AD, HTML Document, XML Document Nr. of instances per facet values
Knowledge Systems Course Facets are Data Views • Each navigation facet is driven by a SeRQL query on the underlying Sesame repository • SeRQL queries can retrieve and transform the data to provide a facet ‘view’ • Spectacle uses the query results to populate the facet with values
Knowledge Systems Course Information visualization 1 • Types • Model visualization • Instance visualization • Examples • Hyperbolic tree, InXight • Graph visualisation, AquaBrowser • Claim of visualization: show things that you can't (easily) express in words or lists
Knowledge Systems Course Information visualization 2 • Cluster Map = instance visualization • visualization of the search results • instances can be things like files, jobs, and people • Map shows AND, OR and NOT of query arguments
Knowledge Systems Course Cluster Map examples
Knowledge Systems Course Aduna AutoFocus AutoFocus helps you to explore data sources like files, websites and e-mail with Guided Exploration. AutoFocus scans data sources and automatically makes suggestions after you entered a search term. So if you are not completely sure what to look for, AutoFocus will help you with suggestions for refinement. Next to that you don’t have to store or search for information in complex directory hierarchies any more. AutoFocus will retrieve it anyway. Combined full text search in documents, websites and e-mail Relations shown in a Cluster Map Automatically generated suggestions help to refine the question Support for multiple data sources: documents, websites, e-mail boxes
Knowledge Systems Course Aduna Spectacle Aduna Spectacle helps website visitors to find what they want with Guided Exploration. Aduna Spectacle supports faceted navigation. Users drill down step by step, making choices on multiple meta-data facets. Spectacle overcomes problems related to over- and underspecification. The user gets the right answer. Visitors find what they want without negative feedback like ´zero results´ Navigation on multiple facets of information collections Use of information increases with faceted navigation Easy to implement on top of your existing information sources
Knowledge Systems Course Pointers • Adunahttp://aduna-software.com/ • AutoFocushttp://aduna-software.com/products/autofocus/ • Spectaclehttp://aduna-software.com/products/spectacle • Sesamehttp://www.openrdf.org/ • Aperturehttp://aperture.sourceforge.net/
Knowledge Systems Course Demo & Discussion Time