310 likes | 326 Views
This project focuses on merging statistics and geospatial information to prepare for the implementation of linked open data in official statistics. It involves identifying data sources, harmonizing statistical units, and transforming data into RDF. The pilot aims to provide recommendations for full implementation. Specific objectives include analyzing data sources for openness and popularity. Administrative boundaries and statistical units are inventorying, and a harmonization process is conducted for statistical units. A LOD pilot is implemented to handle statistical and geospatial data, along with a data sources catalogue using DCAT-AP. Data transformation into RDF is carried out using Python scripts, and the results are stored in a triplestore with Apache Jena Fuseki. The project identifies the lack of pan-European guidelines for statistical linked open data and suggests sustainable Python-based implementations.
E N D
Publishing georeferencedstatistical data usinglinked open data technologies Mirosław Migacz GIS Consultant Statistics Poland Merging statistics and geospatial information grant series NTTS 2019 Conference / Brussels / Belgium
The project • Title: „Development of guidelines for publishingstatistical data as linked open data” • „Mergingstatistics and geospatialinformation” grant series • 2016 – 2017 • maingoal: prepare a background for LOD implementation in officialstatistics
Before 3218 4.4.32.64.18 powiat łobeski (LAU 1) lobeski 4326418
After powiat łobeski http:// nts.stat.gov.pl/4/4/32/64/18
Specificobjectives • identify data sources • identifystatisticalunits • harmonize, generalize and buildURIs for statistical units • transformstatistical data, geospatial data and metadatainto RDF (pilot) • conclude the pilot transformation and fomulaterecommendations for a full-on implementation
Identification of data sources • Other data sources: • publications • tables • communiques • announcements • articles
Data sources - inventory • Metadata: • thematiccategory, • format (PDF, DOC, XLS, CSV), • spatialreference(country, NUTS, LAU, functionalareas, urbanareas), • temporalreference (years) • presence of identifiers(TERYT, NTS, NUTS) • update cycle • Preliminary analysisof data sources: • openness • redundance of information • popularity (based on view/ downloadstats)
Statistical unitsinventory • administrativeboundaries: • administrativeunits • NUTS • Non-standard statisticalunits: • functionalareas/ urbanareas • Groups of administrative / statisticalunits • Derivemostlyfrom strategicdocuments NUTS ADMINISTRATIVE
Statistical unitsharmonization – KTS • KTS – classificationcombining administrative and statisticalunits • introducedlastyear to comply with NUTS 2016 • 14-digit code
Geometry harmonization/generalization • Input data: • administrativeboundariessince 2002 for LAU 2 (gmina), excluding 2007 • Harmonizationprocess: • structurestandardization • standardization of identifiers(creating KTS identifiers) • aggregationto higherlevelunits (LAU 1 -> NUTS 1) • Generalization: • severalgeneralizationscenariostested for purposesof choosingan optimal one • datasets with generalized and non-generalizedgeometriesprepared for 2002-2016
LOD pilot – statistical data • data: • demographic data for 2016 from three major databases (Local Data Bank, Demography Database, STRATEG system), • ontologiesfor classifications: • agecodelistdefinedusingSKOS (skos) & Dublin Core (dct), • sexcodelistre-used from SDMX, addedPolishtranslation, • defininingmetadata for statisticalvalues (observations): • basedprimarily on SDMX ontologies (attribute, code, measure, dimension), • qb:Observationclass from Data Cube.
LOD pilot – geospatial data • inputgeometries: • voivodshipgeometries for 2016, • ontologies: • ontology for the KTS classificationdefinedusing RDF Schema (rdfs) & GeoSPARQL (geo) vocabularies, • geometry encoding: • separategeo:Geometryentities with geometry encoded in WKT (WellKnownText) format (geo:wktLiteral).
LOD pilot – data sourcescatalogue • DCAT-AP (dcat) application profile for data portals in Europe, • data sources as dcat:Datasetclasses, • links to othervocabularies: • EuroVoc (for thematiccategories), • EU Publication Office continent / country codelist (for spatialreference) • Internet Media Type (MIME)
LOD pilot – linking datasetdefinitions for statistical data spatialdomainfor datasets geometries for observations
Data transformationinto RDF 1. Source files in CSV
Data transformationinto RDF 2. PythonscriptusingRDFlib module for transformation:
Data transformationinto RDF 3a. Results in anydesired format (RDF-XML):
Data transformationinto RDF 3b. Results in anydesired format (Turtle):
LOD pilot – triplestore • Apache Jena Fusekiused as a SPARQL server, • 71717 triplesloaded, • single Fusekidataset(STAT_LOD) to allowcross-querying and cross-browsing data created initially in separate files • SPARQL endpointfor querying
LOD pilot – conclusions • No referenceimplementation for statisticallinked open data: • lack of integrity between RDF metadata sets published by one authority, • links to non-existing entities, • lack of maintenance, • Lack of pan-European guidelines for statistical linked open data: • commonvocabularies, • recommendedordedicated software components, • DIGICOM ESSNet LOD project.
LOD pilot – conclusions • Some software/ programming components not being developed anymore, • implementationsmightbecomeunstable, • Python-basedimplementationseemsustainableatthis point, • Semantic harmonization of statistical classifications: • differentmeanings for supposedly the same classificationelements, e.g. 0-5 can be “0 to 5” or “0 to less than five”, • not only a pan-Europeanissue, mayexistat country level,
LOD pilot – conclusions • Methodology for publishing spatial data as linked open data: • single entity per single geometry: • inventory of boundarychanges, • geometry instances with non-meaningfulidentifiers (UUIDs), • separategeometries for respectiveyears: • a complete set of geometrieseachyear, regardless of changes, • geometry instances with meaningfulidentifiers(KTS + year).
LOD pilot – conclusions • Most linked open data implementations are technically correct: • it is nearly impossible to produce incorrect RDF metadata files, • youcanputanything in the RDF graph, but doesitmakesensesemantically? • Linked open data implementations based on Python scripts are easy to amend in the future, • RDF vocabulary specifications are easier to interpret with a UML model provided(Thankyou, CaptainObvious)
Merging statistics and geospatial information grant series Mirosław Migacz GIS Consultant Statistics Poland Publishing georeferencedstatistical data usinglinked open data technologies www.linkedin.com/in/migacz m.migacz@stat.gov.pl NTTS 2018 Conference / Brussels / Belgium