Scalable Integration and Processing of Linked Data

Scalable Integration and Processing of Linked Data Andreas Harth, Aidan Hogan, Spyros Kotoulas, Jacopo Urbani

Outline • Session 1: Introduction to Linked Data • Foundations and Architectures • Crawling and Indexing • Querying • Session 2: Integrating Web Data with Reasoning • Introduction to RDFS/OWL on the Web • Introduction and Motivation for Reasoning • Session 3: Distributed Reasoning: Because Size Matters • Problems and Challenges • MapReduce and WebPIE • Session 4: Putting Things Together (Demo) • The LarKC Platform • Implementing a LarKCWorkflow http://larkc.eu

Session Outline

Goals of LarKC LarKC = a platform for large scalereasoning Quote from EU Project Officer: “LarKC's value is as an experimental platform. LarKC is as an environment where people can go to replicate (or extend) their results in an environment where all the infrastructural heavy lifting has already been taken care of” 4

Goals of LarKC LarKC = a platform for large scalereasoning Quote from EU Reviewer: “Significant progress is sometimes made not by making something possible that was impossible before, but by substantially lowering the costs of something that was only possible before at high cost” 5

What do we mean by: reusable components reconfigurable workflows provide infrastructure needed by all users: storage & retrieval registration of plugins communication (plugin2datalayer, plugin2plugins) synchronisation (anytime behaviour) remote execution (abstracts from local/remote storage) remote data-access (abstracts from local/remote invocation) (will) provide instrumentation & measuring caching integration of very heterogeneous components heterogeneous data: unstructured text, (semi)structured data heterogeneous code: Java, scripts, remote services("wrap & integrate") LarKC = a platform for large scale reasoning 6

What do we mean by: LarKC = a platform for large scale reasoning not only from raw large numbers • from performant data-layer • from parallel deployment of plugins • from load-balancing strategies • … but also from interaction of multiple components • e.g. avoid reasoning through selection: SELECT + REASON 7

Overall approach of LarKC

How to deploy LarKC

Why would people (like you) want to use LarKC

Simplified Framework But what about Flexibility, Modularity Scalability, Distribution?

The LarKC Domain

The LarKC Platform Architecture LarKC Platform Plug-in Registry LarKC RTE Management Interface Plug-in Managers Data Layer Storage Resources Computing Resources RDF Store RDF Doc User Desktop Machine High-Performance Computer Cloud Resource

The LarKC Platform - Components LarKC RTE Initialisation and invocation of workflows Plug-in Registry Management of plug-ins Mgmt Interface Workflow deployment Plug-in Manager Plug-in execution Data Layer Data management

The LarKC Platform - Features Plug-in Registry LarKC RTE Mgmt Interface Plug-in / workflow descriptions and plug-in parameter are in RDF Separation of workflow specification and execution Integration of various endpoints (e.g. SPARQL endpoint) and applications Workflow branching, splits, merges

Workflow Description _:i larkc:pluginTypeOf <urn:eu.larkc.plugin.identify.MyIdentifier> ; larkc:pluginConnectsTo _:t1 , _:t2 . _:d larkc:pluginTypeOf <urn:eu.larkc.plugin.decider.MyDecider> . _:t1 larkc:pluginTypeOf> <urn:eu.larkc.plugin.transform.MyTransformer> ; larkc:pluginConnectsTo _:d . _:t2 larkc:pluginTypeOf <urn:eu.larkc.plugin.transform.MyTransformer> ; larkc:pluginConnectsTo _:d . _:e larkc:endpointType <urn:eu.larkc.endpoint.sparql> ; larkc:endpointConnectsTo _:d . SPARQL Transformer Filter Decider Identifier Filter Transformer

The LarKC Platform - Features Plug-in Manager Data Layer (API) Plug-in (remote) execution, Parallelisation support, Anytime Behaviour Data caching, instrumentation and event processing Data storage, data streaming, parallel request handling

Distributed Execution Support

Distribution: JavaGAT • Toolkit providing adapters to access remote resources • Enables the usage of HPC cluster from / within LarKC workflows • Causes additional overhead depending on network / resource settings

Distribution: JEE Technology • Wrapping a plug-in into a Java servlet and deploying it to a servlet container (e.g., Tomcat) • Overhead relatively small in comparison to JavaGAT

Parallelisation Support Running multiple instances of the same plugin simultaneously Implementation of parallelism in the concurrent regions of the plugin’s code

How it works…

Outline • Session 1: Introduction to Linked Data • Foundations and Architectures • Crawling and Indexing • Querying • Session 2: Integrating Web Data with Reasoning • Introduction to RDFS/OWL on the Web • Introduction and Motivation for Reasoning • Session 3: Distributed Reasoning: Because Size Matters • Problems and Challenges • MapReduce and WebPIE • Session 4: Putting Things Together (Demo) • The LarKC Platform • Implementing a LarKC Workflow

Conclusion

See also • Linked data gathering • Slides, pointers etc at http://sild.cs.vu.nl/

Scalable Integration and Processing of Linked Data

Scalable Integration and Processing of Linked Data

Presentation Transcript

SKOS and Linked Data

Scalable Information Extraction and Integration

UNIMARC and linked data

Linked Data Integration (using reasoning)

Scalable Integration and Processing of Linked Data

Linked Data Visualizations for Eurostat Linked Data

Linked Data

Linked Data

Linked Data

Linked data

Linked Justifications: Provenance Aware Data Integration on Linked Data

RDA and Linked Data

Linked Data

Data integration and Linked Data

RDF and Linked Data

Linked Data

AUTOMATION OF MACROMOLECULAR DATA COLLECTION - INTEGRATION OF DATA COLLECTION AND DATA PROCESSING

SPARQL Query Rewriting for Implementing Data Integration over Linked Data

Scalable Trigger Processing

Libraries and Linked Data

Scalable Trigger Processing

Linked Data