420 likes | 554 Views
Representing Contextual Aspects of Data. Andreas Harth Joint work with Juan Salas. PlanetData 1st EC Review, 7-8 December 2011, Luxembourg. Outline. Motivation Source Datasets NeoGeo Vocabulary Integration Algorithm Integrated Datasets and Services Community Activities Demo
E N D
Representing Contextual Aspects of Data Andreas Harth Joint workwith Juan Salas PlanetData 1st EC Review, 7-8 December 2011, Luxembourg
Outline Motivation Source Datasets NeoGeoVocabulary Integration Algorithm Integrated Datasets and Services Community Activities Demo Outlook Conclusion
Motivation Geodata is becoming increasingly relevant • Location-based services • Mobile applications • Every increasing amount of sensor data (phones, satelites) Data is published in many format • GML, KML, WKT, RDF?… Applications require integrated access to geodata • Spatial querying • Spatial reasoning
GeoData Geospatial data is ubiquitous in information management, whether it is aimed to scientific, industrial or just everyday activities. For this reason, a shared representation of GeoData is of vital importance in the future of the Semantic Web. Example application fields include: Transport Demography Mobile Applications Remote Sensing Commerce (and many more…)
Requirements Integrated data format (syntax) and access (data transfer protocol) • Linked Data (RDF, HTTP) Mapping to a common vocabulary • Focus on representing geographic regions Mappings between instances Algorithms and systems for integrated querying Algorithms and systems for integrated reasoning (integrate that slide with next one)
Integration Challenges Vocabularies – http://geovocab.org/doc/survey.html • Survey of several well-known Linked Data datasets (Ordnance Survey, GeoLinkedData.es, LinkedGeoData.org, GeoNames, DBpedia). • Identified properties and classes mapped to the NeoGeo vocabularies published at GeoVocab.org Instances • Finding equivalences between regions across multiple datasets at the geometry level.
Geodata Integration System Architecture ? ! Integration Wrapper 1 Mapping 1 Mapping 2 Mapping n Source 1 Source 2 Source n
Integration Vocabulary spatial:Feature ngeo:Geometry ngeo:geometry spatial:* GeoVocab.org is an initiative to study methods and tools for the integration of geospatial data on the Semantic Web Geometry Vocabulary – http://geovocab.org/geometry • Representation of georeferenced geometric shapes. Spatial Ontology – http://geovocab.org/spatial • Representation and reasoning on topological relations based on the Region Connection Calculus.
Spatial Ontologyhttp://geovocab.org/spatial Uses RCC vocabulary for the representation of topological relations between regions. Supports RCC5 and RCC8 relations. Inference available for most RCC relations. However some rules require „Negation as Failure“, which is not supported in OWL.
Spatial Properties (RCC-8) PlanetData 1st EC Review, 7-8 December 2011, Luxembourg, Luxembourg
Geometry Ontologyhttp://geovocab.org/geometry Premises: Open RDF Format Fully based on Linked Data principles Based on: ISO 19109 - OGC General Feature Model ISO 19137 - Core profile of the spatial schema
Geometry Ontologyhttp://geovocab.org/geometry Since the Geometry ontology is based on the General Feature Model, it makes a distinction between the feature (resource to which the geometry belongs), and the actual geometry. This approach results in: Semantics of the feature are more important than the representation of the geometry. Instances of the feature are related to the type of the feature. A feature can be related to multiple geometries, not as MultiLineString, MultiPolygon or MultiPoints, but as multiple distinct geometries. This allows to model different geometric properties for one single feature (e.g. different scales). Being it also based on ISO 19137, basically determines the geometries that can be represented: Point, LineString, Polygon, MultiPoint, MultiLineString and MultiPolygon, which should suffice most use cases, without adding extra complexity.
Geometry Ontologyhttp://geovocab.org/geometry Unlike GML/WKT representations embedded into RDF, the Geometry Ontology is fully based on RDF. Advantages: It is possible to agregate or geometries. For example: A MultiPolygon can be composed of several Polygon resources, each with its own URI and Metadata. Allows to add Metadata to individidual parts of the geometries. For example: Label disputed borders as such or compose a polygon with GPS obtained measurements, each having versioning and date of measure. Disadvantages: The geometry must be reasambled in WKT or GML in order to use current libraries for querying or spatial indexing.
Different ApproachesList of W3C Geo Coordinates A geometric shape's coordinates is coded using a list of W3C Geo Point resources. It is based on current implementations of some current RDF spatial datasets such as GeoLinkedData.es and LinkedGeoData.org. Advantages: Allows to add metadata to nodes. Allows to link geometries at node level. Disadvantages: Restricted to WGS 84. Generates a large number of triples, which must be joined when using current libraries for querying.
ExampleList of W3C Geo Coordinates @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix nuts: <http://nuts.geovocab.org/id/> . @prefix ngeo: <http://geovocab.org/geometry#> . nuts:DE123_geometry rdf:type ngeo:Polygon . nuts:DE123_geometry ngeo:exterior _:d1e59878 . _:d1e59878 rdf:type ngeo:LinearRing . _:d1e59878 ngeo:posList ( [ geo:long "8.33996995"; geo:lat "49.08015" ] [ geo:long "8.41577995"; geo:lat "49.2510995" ] [ geo:long "8.46698545"; geo:lat "49.2829755" ] [ geo:long "8.48726795"; geo:lat "49.2900265" ] [ geo:long "8.81823295"; geo:lat "49.194497" ] [ geo:long "8.87779445"; geo:lat "49.0584785" ] [ geo:long "8.57685695"; geo:lat "48.9896935" ] [ geo:long "8.49357245"; geo:lat "48.820182" ] [ geo:long "8.41662495"; geo:lat "48.835368" ] [ geo:long "8.30566745"; geo:lat "48.862568" ] [ geo:long "8.35457445"; geo:lat "48.934889" ] [ geo:long "8.26128395"; geo:lat "48.980917" ] [ geo:long "8.27714095"; geo:lat "48.99016" ] [ geo:long "8.53982195"; geo:lat "48.953889" ] [ geo:long "8.43560245"; geo:lat "49.091529" ] [ geo:long "8.33996995"; geo:lat "49.08015" ] ) .
Different ApproachesSingle Literal All coordinates are concatenated into a single literal value. Advantages: Reduces the number of triples. Allows the use of other coordinate systems than WGS 84. Disadvantages: Does not enable the addition of metadata to single parts of the geometry (at the level of the coordinates). Does not allow to reference shared segments.
ExampleSingle Literal @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix nuts: <http://nuts.geovocab.org/id/> . @prefix ngeo: <http://geovocab.org/geometry#> . nuts:DE123_geometry rdf:type ngeo:Polygon . nuts:DE123_geometry ngeo:exterior _:d1e59878 . _:d1e59878 rdf:type ngeo:LinearRing . _:d1e59878 ngeo:posList "8.33996995 49.08015,8.41577995 49.2510995,8.46698545 49.2829755,8.48726795 49.2900265,8.81823295 49.194497,8.87779445 49.0584785,8.57685695 48.9896935,8.49357245 48.820182,8.41662495 48.835368,8.30566745 48.862568,8.35457445 48.934889,8.26128395 48.980917,8.27714095 48.99016,8.539821950 48.953889,8.43560245 49.091529,8.33996995 49.08015" .
Different ApproachesList of coordinate literals Mixes both previous approaches, coding the coordinates as a list of literales, each of which encodes a segment of coordinates. Advantages: Allows the user to choose the level of granularity desired. Enables to group contiguous parts of a geometry which have the same metadata. Permits to reuse shared borders easily. Allows to use other coordinate systems than WGS 84. Disadvantages: Segments must be joined for querying with current libraries.
ExampleList of coordinate literals @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix nuts: <http://nuts.geovocab.org/id/> . @prefix ngeo: <http://geovocab.org/geometry#> . nuts:DE123_geometry rdf:type ngeo:Polygon . nuts:DE123_geometry ngeo:exterior _:d1e59878 . _:d1e59878 rdf:type ngeo:LinearRing . _:d1e59878 ngeo:posList ( "8.33996995 49.08015,8.41577995 49.2510995,8.46698545 49.2829755,8.48726795 49.2900265,8.81823295 49.194497" "8.87779445 49.0584785,8.57685695 48.9896935,8.49357245 48.820182,8.41662495 48.835368,8.30566745 48.862568" "8.35457445 48.934889,8.26128395 48.980917,8.27714095 48.99016,8.539821950 48.953889,8.43560245 49.091529,8.33996995 49.08015" ) .
Geospatial Datasets GADM-RDF– http://gadm.geovocab.org • RDF representation of the administrative regions of the GADM project: http://gadm.org NUTS-RDF– http://nuts.geovocab.org • RDF representation of Eurostat's NUTS nomenclature. They serve as: • New geospatial information on the Semantic Web. • Bridges between already published spatial datasets. • Proof-of-concept platforms.
VocabularyMappings SC: rdfs:subClassOf, SP: rdfs:subPropertyOf, SA: owl:sameAs PlanetData 1st EC Review, 7-8 December 2011, Luxembourg, Luxembourg
Instance Mappings PlanetData 1st EC Review, 7-8 December 2011, Luxembourg, Luxembourg
Geometric Equivalences Geometric shapes will not be vertex by vertex equivalent. A sensible criterion for finding geometric equivalences is needed. NUTS-RDF and GADM-RDF have different: • Sampling values • Scales • Starting points • Rounding effects
Algorithm Overview WGS-84, Plate Carrée projection 1 Hausdorff distance 1 spatial:EQ *
1. Retrieve sample data The algorithm requires: • WGS-84 coordinate reference system. • Plate Carrée projection: X = longitude Y = latitude Coordinates are treated as Cartesian. Distorts all parameters (area, shape, distance, direction). • Geometric shapes are equally distorted on both datasets. Local reprojections are avoided (e.g. UTM). Units will be presented in centesimal degrees.
2. Similarity threshold function The Hausdorff Distance provides a measure of similarity between geometric shapes. Can be intuitively defined as the largest distance between the closest points of two geometric shapes.
2. Similarity threshold function Smaller regions need a lower Hausdorff Distance threshold than larger regions.
2. Similarity threshold function We calculate the midpoint value between the Hausdorff Distances for a correct guess and the lowest wrong guess.
2. Similarity threshold function We perform regression on the midpoint values to obtain the Hausdorff Distance threshold function.
Poor Geospatial Information Sometimes location is approximated as a single point. Can lead to false assertions while calculating containment relations. <http://dbpedia.org/resource/Germany> geo:lat 52.516666; geo:long 13.383333 . <http://nuts.geovocab.org/id/DE30_geometry> rdf:typengeo:Polygon . Germany is not contained in Berlin. Other properties must be considered to calculate containment relations (e.g. rdf:type). Other spatial relations (e.g. spatial:EQ) cannot be calculated.
Optimizations The cost of calculating the Hausdorff distance depends on the amount of vertices. The Ramer-Douglas-Peucker algorithm allows to simplify geometric shapes, using an arbitrary maximum separation.
Spatial Databases The algorithm works also well with spatial databases (e.g. PostgreSQL / PostGIS): SELECT g.gadm_id, n.nuts_id FROM nuts n INNER JOIN gadm g ON (n.geometry && g.geometry) WHERE n.shape_area BETWEEN (g.shape_area * 0.9) AND (g.shape_area * 1.1) AND ST_HausdorffDistance( ST_SimplifyPreserveTopology(n.geometry, 0.5), ST_SimplifyPreserveTopology(g.geometry, 0.5) ) < g.max_hausdorff_dist;
Evaluation GADM 2_13988 Leicestershire NUTS UKF2 Leicestershire, Rutland and Northamptonshire Not every NUTS region matches a GADM region. • Many NUTS regions represent parts or aggregations of GADM administrative boundaries. 1,671 NUTS regions => 965 matches & 13 false positives.
Currently available resources NeoGeo vocabulary and best practices for publishing geodata as Linked Data NUTS and GADM dataset online Integration vocabulary online, including mappings GADM mappings to Dbpedia Linked Data Services for accessing/querying spatial indices (withinRegion, boundingBox) Work on similarity metrics (with optimisations and evaluation) for geospatial regions
Future Work Finalisation of NeoGeo vocabulary Improvement of precision of spatial similarity; publish service online More earth and space science data Tools to support the mapping process More instance mappings to GADM Possibly map to sensor descriptions More experiments: querying of integrated data Include reasoning Temporal context
Conclusion GeoVocab.org published vocabulary and vocabulary mappings NUTS and GADM use vocabulary and instance-map to several well-known other datasets Several services online Using an optimised algorithm for the detection of spatially co-located features across multiple RDF datasets More work to be done, including coordination with other efforts