210 likes | 330 Views
Boca – features of an enterprise-ready Semantic Web storage system part of the open-source IBM Semantic Layered Research Platform http://ibm-slrp.sourceforge.net. Ben Szekely April, 2007. RDF: Representing data as a graph.
E N D
Boca – features of an enterprise-ready Semantic Web storage systempart of the open-source IBM Semantic Layered Research Platformhttp://ibm-slrp.sourceforge.net Ben Szekely April, 2007
RDF: Representing data as a graph • RDF models data’s content and meaning, rather than just its structure or serialization • RDF can more accurately represent the entities being modeled • Real objects, concepts, and processes often have a ragged shape • Can represent objects with complex structures directly without exposing implementation techniques • Data schemas do not need to be determined a priori
email name “David J Grossman” “693-0120” “djg@us.ibm.com” phone DavidJGrossman RDF: Example Subject Predicate Object RDF describes relationships as a directed graph with labeled nodes and edges.
RDF: Giving resources and relationships unique names • Name everything with URIs (Universal Resource Identifiers) • Ensures that resources, attributes, relationships, and data types have unique names that can be widely shared • Delegates identifier creation down to smaller expert groups • Can often be dereferenced to find their defined meaning • Are often long enough to be human readable http://www.ibm.com/people/DavidJGrossman instead of DavidGrossman or Emp12345
Most examples of RDF triple stores focus on specific difficult problems • Focused on inference or standards • Preoccupied with “Billions of Triples” • Little thought given to application programming model • Not multi-user (limited security)
Selective RDF replication from server to client machines Security, including named-graph-based RDF access control Audit trails of changes to data within named graphs Near real-time event notifications Sophisticated programming model Boca Overview – Multi-user, distributed enterprise RDF repository
Underlying Technologies • Relational Database (DB2, Oracle, MySQL) • RDF triples stored in a table (subject, predicate, object, named graph) • Save space by normalizing URIs and strings to integer ids. • Extra tables for history, ACLs, replication • J2EE (Jetty, Tomcat, WebSphere) • Jetty: Standalone server, checkout from CVS and run for testing • WAS: Enterprise-ready Web-application server for real deployment • JMS Server (Active MQ, WebSphere MQ) • pub-sub messaging used for real-time notifications of triple updates.
Named Graphs • A named graph is the logical unit of RDF storage in Boca. • The named graph is the first-order unit of data access in the Boca programming model. • Each triple exists in exactly one named graph • The same S,P,O in two different graphs implies two separate statements. • Adding and removing triples is done in the context of a named graph • Each named graph has a metadata graph, containing information such as ACLs • Named graphs can be exposed via URLs, Web Services, LSIDs
Replication • Boca clients have a persistent local RDF store that mirrors a subset of the triples on the Boca server. • Replicated subset specified by: • Triple patterns; e.g.(<http://tdwg.org/meetings/GUID-2#>, <http://tdwg.org/preds/hasParticipant>,*) • Named graph URIs • Triple patterns within named graphs • When a replication is initiated, the service computes what has changed in the subset based on pattern and graph subscriptions. • Replication can work as a background process on the client, or be explicitly initiated. • Applications can query/write against graphs in the local and server models.
Notification – maintaining the replica in real-time • Updates to named graphs on server are published in near real-time to clients. • Local replicas can be kept up-to-date between replications. • Notification is central to distributed RDF applications • Ex: workflow, collaboration
Access Controls • Boca uses can have the following system-wide permissions: • canInsertNamedGraphs -- a user must have this permission in order to create a new named graph (i.e. insert statements into a graph that does not yet exist in the system) • Boca users can have the following per-named-graph permissions (these apply also to the system graph): • canRead -- a user with this permission may view the triples in the named graph and in its metadata graph • canAdd -- a user with this permission may insert new triples into the named graph • canRemove -- a user with this permission may remove triples from the named graph • canChangeNamedGraphACL -- a user with this permission may change the ACL triples in the metadata graph • canRemoveNamedGraph -- a user with this permission may entirely remove the named graph from the system
Versioning • SVN-like approach to versioning • When a triple is added to or removed from a named graph, a new revision of that named graph is created. • Simple API for reading old revisions • Provides a straightforward mechanism for concurrent distributed computing. • When a client submits an update to a named graph, it may specify the version number that it currently has. The update will fail if the graph has been more recently modified.
Querying Boca • Users may query Boca in a variety of ways. • Query the complete database • Query a subset of named graphs • Query a particular named graph • Query the local store
SPARQL: Querying any data as RDF • SPARQL is a SQL-like language for querying distributed RDF graphs • RDF can be created on-the-fly from any data source • SPARQL is designed to handle: • Distributed data. Multiple distributed data sources can be queried at once because SPARQL addresses graphs by URI. • Ragged data. The SPARQL OPTIONAL keyword lets users explore heterogeneous data in a single query. • Unpredictable data. The ability to query for predicates and information about predicates makes SPARQL ideal for exploring new and unexpected data. • Open-world assumption • Example: • Show me all the AP stories where IBM is mentioned along with another Fortune 500 company. If present, also include the names of any analysts quoted in the article.
“David J Grossman” “693-0120” “djg@us.ibm.com” http://…/DavidJGrossman SPARQL: Example SELECT ?name ?phone WHERE { ?person <email> “djg@us.ibm.com” . ?person <phone> ?phone . ?person <name> ?name . } email name phone
Abandoned features – Collections, Statement ACLs & Reification • Collections – a statement can exist in multiple collections • A more difficult programming model, what happens when I delete in the context of one collection? • Expensive to maintain • Not a widely accepted programming model (as named graphs are) • Statement-level ACLs • Too expensive • Difficult to program • Not particularly useful, other than the odd, very important statement • In that case, such a statement can live in its own named graph • Reification • Queries were very difficult to formulate • Most RDF applications do not deal with reification • Reification semantics often confused with true quoting • Reification is an arbitrary layer of indirection that can be solved with ontologies
Future Features • Arbitrary query-based replication/notification • Distributed servers
Building (Semantic (Web) Applications) atop Boca • Visualization of Semantic Data • Generation of Semantic Data via forms and drag ‘n’ drop. • SPARQL query interfaces • Semantic Annotation • Merging data from multiple sources
Semantic Web Application Challenges • The Boca API is relatively simple, but probably not simple enough for the breadth of Web developers we want to reach • Lack of good RDF tooling on the browser • Overwhelming choice of transport protocols for AJAX requests • Semantic content management • Binding of RDF data to DHTML widgets
Queso – A Semantic Web Content Management System • Boca Atom Publishing Protocol endpoint • Data enters and leaves the system through APP REST API • Post, Put, Get, Delete, Undelete, Purge • Atom entries stored in Boca, Binary content in file system • Revision histories of entries and binary data provided via feeds • Elaborate caching mechanisms • Optimistic concurrency through Boca preconditions • RDF-DHTML Widget data binding system • Collections of Dojo widgets, grouped into lenses • A lense renders a named graph whose URI is of a certain rdf:type, or the results of a standing SPARQL query • A Javascript/HTTP servlet-based infrastructure manages replication of data between browser models/widgets and the server • No RDF manipulation on browser!
Conclusion, future work, questions? • Boca will continue to be supported by IBM as an open source project for the foreseeable future • The number of adopters of Boca continues to grow, both within IBM and without