Linked Open GeoData Management in the Cloud

Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

Cloud Computing Better (faster, reliable, etc.) infrastructure - IaaS Development infrastructure – PaaS Software infrastructure – SaaS

Cloud Computing • Publication • Querying • Updating Data as a Service (DaaS) Data

Linked (Open) Data as a Service • Publishing Linked Data • URI construction • Conceptual Model • Storage as RDF files or SPARQL endpoints • Querying Linked Data • SPARQL • GeoSPARQL • Updating Linked Data • SPARUL • Synchronization with original sources

Problem Introduction (I) • INGeoCloudS FP7 Pilot B Project (www.ingeoclouds.eu) • Geophysical data from different sources and in different formats (excel, xml, relational, nothing …) • Borehole and Groundwater Water Analysis • Boreholes located in Mygdonia/Thriasio of Greece, whole country in Denmark and France and their features (static data over time) • Chemical analyses of ground waters sampled from their boreholes (data updated over time) • Earthquake events and features • Landslides

Data granularity Data refer to different levels of granularity, e.g. susceptibility maps refer to a country-wide area while earthquakes or boreholes are point-level data Data might need to be aggregated by such aggregation is based on the spatial dimension, i.e. points contained within a polygon Some problems of aggregation do exist since phenomena outside the area of concern may affect it, so spatial aggregation might not be enough

Landslides: • Which area and how much is it affected? • How does this change over time? • Is the earthquake effect cumulative or fades over time? • Earthquakes: • How much back in time should we go? • What information should be kept/would be relevant? • How should we query the repository to get the relevant information?

Problem Introduction (II) • Data/Metadata Standards • INSPIRE standard proposes generic conceptual schema for scientific data + models for 34 spatial data themes • Deal with geospatial data & maintaining schemas/ontologies becomes difficult • Challenge is to exploit semantic heterogeneity • Need to offer seamless & transparent LOD as a service (LODaaS) way to manage LOD data • Lack of tools for mapping, transforming & synchronizing geo-spatial LD • Generic LOD management independent of way LOD are stored

Points of interest • GeoData get bigger and more important • Used in a variety of applications in different fields • Size & high demand impose considerable requirements in infrastructure storage size & compute power • Need to be reused and linked with other data sets • Go beyond current Web paradigm of isolated data silos • Current geo-spatial open data management work does not offer such effort • Cloud-based approaches: • do not provide geo-spatial support • Some do not fully support SPARQL or offer SPARQL end-points • Centralized approaches offer geo-spatial support but: • Do not enable automatic mapping between relational and RDF data • Worse performance in general (with the exception of Strabonwrt geo-spatial query support)

Proposed Solution (I) • A specific set of LODaaS services for geo-spatial LOD publishing, integration & querying • Cloud is offering its scalability & elasticity of computation, 24/7 availability & multiple data storage and integration offerings • Our cloud-based service-oriented system: • Exhibits good LOD management performance • Exposes a LOD management service that abstracts away RDF Store peculiarities & provides a generic way for LOD access and management

Proposed Solution (II) • A particular solution is adopted for mapping geo-spatial data in different formats to RDF data • The latter conform to extensible conceptual models that accurately capture thematic areas and are integrated via GeoScientific Observation Model • This allows imposing queries across providers and thematic fields • Our solution is part of the system, developed in the context of the InGeoCloudS project, that exploits cloud capabilities & LD technology to integrate & store heterogeneous geo-spatial data sets of different thematic fields + host & execute applications that exploit these data sets

Architecture (I) • System is scalable and elastic by exploiting cloud facilities • An extensive application pool can be built on top that exploits the offered services to perform various added-value and high-demanding tasks: • LO GeoData visualization, discovery & composition of data-sets, LO GeoData analytics • System could be extended to host such applications & offer various (geo-spatial) LO GeoData processing services and pre-built applications

Architecture (II) • Distributor: equally distributes generic queries & collects back the results, non-generic queries are sent to instances with the appropriate data, data distribution achieved by assigning new data to the less loaded wrt storage space scaling layer, exploits CPU monitoring & elasticity facilities of Amazon • Scaling Layer: comprises one or more LOD management components, data are replicated across these components to enhance reliability & enable layer-based load balancing • LOD Management Component: comprises LOD Management Service (LMS) instance & Virtuoso server for storage • LMS: provides methods for data providers to manage LOD & for other users to query & export the LOD stored • Virtuoso: underlying RDF triple store also allowing the mapping & synchronization between relational and RDF data

General Query Evaluation Behavior Response Time 2nd instance involvement Time passed

LOD Integration & Publishing (I) • Extension of the high-level CIDOC-CRM conceptual model • New model is called Geo-Scientific Spatial Observation Model (GSOM) & expressed in RDF/S • It enables to capture all information coming different fields & countries + link data across different providers • INSPIRE was not exploited as did not cover all requirements: • Capturing of scientific events • Complicated and cumbersome for information integration • In some cases, does not cover all appropriate information required by the data providers in particular thematic fields • GSOM-to-INSPIRE mapping specification to enable exporting INSPIRE-compliant data

LOD Integration & Publishing (II) • Two alternatives for publishing LOD: • Create and import RDF-based descriptions of data-sets via particular LMS method • Data update process must be controlled by performing SPARUL updates via particular LMS method • Data provider responsibility to keep synchronized relational & RDF data • A perfect synchronization may be also not required as it may incur costs -> second alternative becomes more preferable

LOD Integration & Publishing (III) • Data provider publishes relational data of his/her data sets + provides a mapping file in R2RML to enable the synchronization of relational to RDF data (by executing LMS method) • System takes care of this synchronization • Relational storage in the way used many years + additional RDF storage for the data with automatic one-way synchronization between the two • Provider should have a good knowledge of GSOM & RDF

LOD Integration & Publishing (IV) • R2RML: • W3C recommendation since 2012 • Can specify customized mappings between RDB & RDF data • R2RML specification is just a RDF graph in Turtle • No specific implementation is imposed • Virtuoso supports R2RML by processing the R2RML specification & creating the respective RDB2RDF triggers (used for creating/updating RDF data from relational ones) • An RDF view or physical RDF graph can be created with the second option mapping to far better performance

R2RML E26.Physical_Feature GSOM O4.sampled_from S15.Acquifer_ Concept Intake P121F.overlaps_ with S16.Borehole S2.SampleTaking O5.removed P43F.has_ dimension P1F.is_identified_by S13.Sample E41.Appelation Borehole_Name E42.Identifier Sample_ID, E54.Dimension Waterlevel URI Identification: http://orgURL/SampleID/XYZ P1F.is_identified_by Publication Borehole Relational Model RDB Synchronization

LOD Management Service (I) • REST-based service with API exposing all appropriate management functionality needed by geo-spatial LOD users • Abstracts away from peculiarities of RDF triple stores • Enables simple & intuitive use of a specific set of LOD management methods • Programmatic or form-based access to methods • Production of query results in different forms, such as WKT, GML, & KML • Imporing/exporting capabilities in different formats (RDF/XML, NTriples, Turtle)

LOD Management Service (II) • The provided methods are: • meta_query (SPARQL string, timeout (opt.), row limit (opt.)): user-requested format (e.g., JSON) • meta_update (SPARUL string, baseURI, timeout (opt.), row limit (opt.)) • meta_addMappings(R2RML string, graphURI) -> initiates mapping procedure • meta_export(graphURI, subjURI, predURI, objURI, internal): user-requested format -> last param indicates if result will be inline in the response • meta_import(url, graphUri, format, blocking): ImportStatus -> RDF data are imported by downloading them via provided URL or inline in user-request + method can be blocking or non-blocking • import_status(importID): ImportStatus -> in case of blocking import request, the user can inquire the status of his/her import by exploiting the value of a specific field (importID) returned from the previous method as input to this method

LOD Management Service (III) • Each method accessible via specific URL + produces meaningful exception messages (e.g., in case user input is wrong) • User-friendly HTML Documentation produced via Enunciate • Implementation exploited Sesame RDF Data Management API, Virtuoso’s JDBC Driver & Jersey

Open Issues (I) • Model: • Extend it to capture other thematic fields • Data published in our system could fulfill all requirements to be 5-star LOD if respective owners decide to do so • Data mapping: • Cloud-based Virtuoso version supports native Relational DB for RDB2RDF synchronization • Trade-off between LOD management completeness & cost • Mapping tools are needed to allow visual-based editing of R2RML without needing from data providers to have good knowledge of RDF • Research issue: support bi-directional RDB2RDF mappings

Open Issues (II) • Geo-spatial query support: • Virtuoso does not support GeoSPARQL • Virtuoso has limited geo-spatial query support only in commercial versions • 2D geometries + limited set of topological relation operators • Additional support in terms of geometry dimensionality + feature aggregation operators • Could extend Virtuoso via frameworks, such as uSeekM, which provide adequate geo-spatial support along with the capability of evaluating GeoSPARQL queries • Such solutions require processing all RDF data stored to create geo-spatial indices as well as deploy another DB -> do not fit well with automatic geo-spatial LOD management • Could resolve problem by: (a) performing re-indexing in infrequent time intervals, (b) create specialized triggers which trigger re-indexing only when RDF data are updated

Open Issues (III) • Quality & provenance: • Original input data sets may not have the appropriate quality -> resulting RDF data can have the same or lower quality level • Proposed infrastructure must be extended with quality resolving procedures & methods (e.g., data cleansing methods for correcting the data exploited) • Provenance information can ensure the correct updating of LD + assist in LD reasoning process by deriving additional facts • Thus, provenance information should be exploited by our system, especially if we consider that such exploitation is not enabled by most LOD management systems

Conclusions • Proposed a scalable, geo-spatial LOD as-a-Service management system deployed on Amazon cloud • Distributes query load + scales-up/down when CPU utilization surpasses specific thresholds • Exposes REST-based service with LOD management methods • Provides two different ways for publishing open geo-spatial data sets • Advance geo-spatial support level by following two directions: • Realize GSOM-to-INSPIRE mapping to enable producing INSPIRE-compliant data • Extend Virtuoso with geo-spatial indexing & query systems to enable the efficient processing of rich & expressive geo-spatial queries, expressed either in SPARQL or GeoSPARQL

Linked Open GeoData Management in the Cloud