190 likes | 371 Views
Knowledge Streams: Stream Processing of Semantic Web Content. Mike Dean Principal Engineer Raytheon BBN Technologies mdean@bbn.com. Assumptions. Technology – Intermediate Familiarity with RDF and OWL Interest in Stream processing Scalability. Presenter Background .
E N D
Knowledge Streams: Stream Processing of Semantic Web Content Mike Dean Principal Engineer Raytheon BBN Technologies mdean@bbn.com
Assumptions • Technology – Intermediate • Familiarity with RDF and OWL • Interest in • Stream processing • Scalability
Presenter Background • Principal Engineer at Raytheon BBN Technologies (1984-present) • Principal Investigator for DARPA Agent Markup Language (DAML) Integration and Transition (2000-2005) • Chaired the Joint US/EU Committee that developed DAML+OIL and SWRL • Developer and/or Principal Investigator for many Semantic Web tools, datasets, and applications (2000-present) • Member of the W3C RDF Core, Web Ontology, and Rule Interchange Format Working Groups • Co-editor of the W3C OWL Reference • Local co-chair for ISWC2009 • Other SemTech presentations • Semantic Query: Solving the Needs of a Net-Centric Data Sharing Environment (2007, w/ Matt Fisher) • Semantic Queries and Mediation in a RESTful Architecture (2008, w/ John Gilman and Matt Fisher) • Use of SWRL for Ontology Translation (2008) • Semantic Web @ BBN: Application to the Digital Whitewater Challenge (2009, w/ John Hebeler) • How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus (2009) • Finding a Good Ontology: The Open Ontology Repository Initiative (2010, w/ Peter Yim and Todd Schneider)
Outline • Motivation • Vision • Building Blocks • Demonstration
Motivations • Timeliness • Performance
Timeliness • Streaming minimizes latency • Processing elements see events as they occur • Resources are expended only when an event occurs • This is in contrast to polling • Latency averages half the polling interval • Resources are expended on every poll • Popular web syndication mechanisms such as RSS and Atom involve polling
Performance • Many Semantic Web tools provide streaming parsers rather than, or in addition to, model access • Analogous to XML SAX vs. DOM • For suitable applications, this can be 10x faster than loading all statements into memory or a KB
2 Streaming Stories • dumpont of OpenCyc (circa 2003) • HTML-based ontology visualization tool periodically bogged down daml.org server • Reimplementation using event-based Jena ARP parser yielded 10x performance and scalability improvements • Billion Triples Challenge 2009 • Streaming analysis of the 2009 corpus was performed at an overall rate of 103K statements/sec on a Mac laptop with a portable external disk • Compare to loading 10-20K statements/second on a server
Stream Processing Examples • Unix pipes • Dataflow architectures • Streambase • IBM System S/InfoSphere Streams
Semantic Web Sensor Network Gazetteer Imagery Database Archive Sensor IM Vision: Knowledge Streams Users Community of Interest 1 Data Sources • Processing elements • Consume and produce subgraphs • Multiple functions may be combined aggregation context filter augmentation inference User 1 Community of Interest 2 • Persistent pipelines • Streams of statements comprising object subgraphs • URI naming allows drill-down • Provenance, timestamps User 2 distribution correlation persistent queries translation alerts CEP NLP RSS User 3 Distribution And Processing Elements
Goals • Web-scale • Decentralized among multiple sites • Heterogenous implementations • Long-lived, persistent connections • User accountability • Introspection over the processing network for control and optimization • E.g. aggregating subscriptions • Balance with security, privacy, and autonomy concerns
Building Blocks • RDF Content • Existing stream processing frameworks • Workflow systems • Publish/subscribe message oriented middleware
RDF Payloads • Malleable data • Standards-based graph structure • Can easily add, remove, and transform statements • Self-describing • Unique naming via URIs • References to vocabularies and ontologies • Potential for inference
Workflow Systems • Graphical environments for developing processing pipelines • Yahoo Pipes, DERI Pipes, SPARQLMotion • Nice user interfaces for development and execution http://pipes.deri.org
Semantic Complex Event Processing • Complex Event Processing • One of the leading edges of rules technology • Formal specification of higher-level events in terms of lower-level events • E.g. alert if the moving average increases 15% within a 10 minute window • Engine can be compiled/optimized for a specific rule set • High-volume deployments in finance and other industries • Most implementations focus on self-contained tuples • Semantic Complex Event Processing • Enrich CEP using Semantic Web technology • Emerging topic at recent conferences • Early implementations • Wrappers around open source CEP engines • Native implementation • Provides a powerful set of operators and engines for Knowledge Streams
Implementation Approach • Well-defined APIs for implementing operators • Operator execution containers • Could encapsulate existing engines • Start with manual processing network configuration, then automate
Use Cases • Dissemination of metadata for new satellite imagery • Social network changes • Alerting of friends’ new publications • …
Demo • Processing using DERI Pipes with new operators • Ingest of #SemTechBiz tweets using Twitter Streaming API • Conversion of JSON to RDF • Mapping to SIOC vocabulary using SWRL rules • Enrich by matching Twitter @handles with contacts • Persistent buffering using Java Message Service • Monitoring