440 likes | 653 Views
New approaches to data modelling. Prepared and presented by Francesco Rizzo (ISTAT, Italy) and David Barraclough (OECD). Content. Why and how SDMX Disseminating multi-dimensional statistical tables Statistical datasets and data cube model Metadata in SDMX SDMX Information Model
E N D
New approaches to data modelling Prepared and presented by Francesco Rizzo (ISTAT, Italy) and David Barraclough (OECD)
Content • Why and how SDMX • Disseminating multi-dimensional statistical tables • Statistical datasets and data cube model • Metadata in SDMX • SDMX Information Model • SDMX main artefacts • SDMX Data Modelling • Dissemination capabilities and business constraints • Data modelling principles • SDMX data modelling main steps • Modelling Unit information
Why and how SDMX • Modernising the Dissemination business process: • Using statistical standards: GSBPM, SDMX • SDMX improves quality and efficiencies in the exchange and dissemination of data and metadata: • harmonization and coherence of data (semantic interoperability); • preservation of meaning; • open formats (XML, JSON, CSV) rather than a proprietary • SDMX IM for Linked Open Data vocabulary (RDF Data Cube). More in general Open (statistical) Data • Software free and open source easily to integrate (component-based) within a Statistical Information System
Why and how SDMX • SDMX use cases: • Reporting (to the International Organisations) • “push” - reporter pushes data to collector • “pull” - collector pulls data from reporter • Metadata management • Data warehousing • SDMX as a model for the structure of a data warehouse or metadata repository • for extraction, transformation, and load of data • Dissemination (GUI browsing and machine2machine) • to drive website presentation of data and metadata • as a queryable data source • for standardized file downloads • pull method is most useful, can cater for different queries • Validate data using XML schemes and VTL
Statistical datasets and data cube model • A statistical data set comprises a collection of observations made at some points across some logical space, • The collection can be characterized by a set of dimensions that define what the observation applies to (e.g. time, area, gender) along with metadata describing what has been measured (e.g. economic activity, population), how it was measured and how the observations are expressed (e.g. units, multipliers, status) • We can think of the statistical data set as a multi-dimensional space, or hyper-cube, indexed by those dimensions (This space is commonly referred to as a cube for short) • A cube is organized according to a set of dimensions, attributes and measures • Data can be “sliced” by performing a query on the dimension items in the cube.
Example of statistical data cube Tourism activity Time Country B020 ES 2004
Time series representation ES FR IT AT Country
Data Reference Metadata grouped into grouped into Metadata set Data set described by described by Structural Metadata DSD MSD Metadata in SDMX • Structural metadata: • Metadata act as identifiers and descriptors of the data (e.g. variables, code-lists, dataset, data structure definitions, etc.) • Reference metadata: • Conceptual metadata (e.g. definition of income) • Methodological and processing metadata (e.g. description of data processing) • Quality metadata (e.g. Availability)
High level schematic of the SDMX Information Model Categorisation
SDMX main artefacts (1/2) • Concept Scheme a list of Concepts, for a specific statistical domain, or for other purposes such as cross-domain Concepts • a Concept provides the semantics (and possibly also default representations) for Dimensions (Age, Reference area, Sector, Time), Data Attributes (Observation status, Unit multiplier) and Metadata Attributes • Code List alist of Codes from which some statistical Concepts (coded Concepts) take their values. A Statistical classification can be represented as a Code List • Data Structure Definition a DSD describes the structure of an organised collection of data (using Concepts defined in Concept Schemes and Codelists) identifying: • which Concepts are used as Dimensions and their representation (coded or free text) • which Concepts are used as Attributes and their representation (coded or free text), attachment level (Dataset, Dimensions group, Observation), obligatoriness • which Concepts are used as Measures (at least one measure must be identified)
SDMX main artefacts (2/2) • Data Flow filtered (constrained) sub-cube defined on a DSD. It is used to present simplified perspectives/views of the data to users • Constraint limits Dimensions and Attributes in a Dataflow to a set of Codes • Category Scheme container for Categories. Categories are used to classify any other structure in SDMX (e.g. Data Flows). Very often a Category Scheme in dissemination database GUI is identified by a Theme(Topic)-Tree • Categorisation a link between a Category in a Category Scheme to any other Identifiable structure in SDMX (e.g. a category with a Data Flow) • Metadata Structure Definition a MSD is used to define a template for reporting and disseminating Reference Metadata • Metadataflow it is like a Dataflow but for reference metadata
Data Structure Definition, Dataflow, Dataset DSD Dataflow Dataset Constraints on dimensions describes Constraints on dimensions describes
Skills of the Data Modeler • Business analysis skills to discuss and interpret the needs of users • Multidimensional modelling to understand statistical hypercubes, need to slice data • Metadata management to store, retrieve, version structural metadata • Dissemination/reporting experience to understand dataflows • (advantageous) Knowledge of the statistical domain
Important resources for data modelling • Modelling Statistical Domains in SDMX • Guidelines for SDMX Data Structure Definitions • Checklist for SDMX Design Projects Home
Step 1: Agree on the exchange needs (GSBPM: Specify Needs) • Data sources (e.g. Excel, Pdf, SAS/STAT/SPPS files, Legacy database, etc.) • Expected statistical visualisations (tables, graphs, full downloads, etc.) • Use case(s) of the data to be modelled, e.g. reporting, dissemination • Tables to disseminate • Available tools • Governance and maintenance constraints of the metadata and data • Any constraints on remodeling the data and metadata
Step 2: Plan the modeling project • Ensure the team has the knowledge • Define a roadmap • Identify the SDMX modelling tools and methodology to use • Method and tools used to collaborate with statisticians/modellers/IT • Issue log • Project plan with scope and above details
Step 3: SDMX data modelling main steps (GSBPM: Design) • Describe the Data Flows in the exchange. This is an examplefrom Global Labour Statistics
Step 3: SDMX data modelling main steps (GSBPM: Design) • Create the Data flow model (previous slide) • Create a Concept Scheme that satisfies the exchange needs • Decide the roles of the concepts • For each Data Flow, specify which concepts will be used and the subset of allowed codes • Optimise the model, removing or merging concepts that are rarely used in the Data Flows • Create DSDs by assigning Data Flows that have similar dimensionality • Create the Code Lists for the Concepts • Categorise the Data Flows. Allows navigation in registries, browsers
Some data modelling principles • Reuseof existing common structures (The Global SDMX Registry is the primary location to search for “global” and “cross-domain” SDMX artefacts) • For visualization, minimisethe numbers of Concepts displayed • It is preferred to show at most 5 Dimensions in a table • Merging Concepts may be done for short breakdown codelists, but this may compromise filtering and simple processing of the information • In .Stat, hiding concepts can be achieved using Dataflows and Constraints without merging concepts . Single-fixed concepts are hidden • Optimise the statistical domain model, taking in consideration the requirements of the Domain Manager • the minimal number of Concepts used in a DSD, • the minimum data sparseness in Data Flows, and • the minimum number of DSDs required • Use the SDMX concept-oriented guidelines! (COG)! • Especially the Guidelines for the creation and Management of SDMX Code Lists
Step 4: Creation of SDMX artefacts (GSBPM: Build) • Create the SDMX artefacts from the previous step • Several tools automatically generate the SDMX-ML • Upload the artefacts to the structural metadata repository • In .Stat, use the Data Lifecycle Manager to store the artefacts in the .Stat datawarehouse • Create guidelines, especially if the project replaces a non-SDMX reporting framework or if it is new • Perform internal and external tests
Example of identifying which tables must be disseminated 1) Domain manager requirements 2) I.Stat Data modelling 3) I.Stat database
An example of finding statistical concepts that describe the tables
An example of statistical concept coding and their code lists
Issues with Indicators (or Variable, Subject, etc.) Attempts have been made to define what the difference might be between these • Often the name/code is not used uniformly (indicator, measure, variable, subject, transaction; IND, INDICATOR, MEASURE, VAR, VARIABLE, etc.) In many instances, the compound indicator is used to denote “that one dimension into which we crammed all the information that we could not allocate into one dimension”. Some ‘indicators’ • Adolescents (15-year-olds) who report regularly engaging in vigorous physical activity outside of school (%) • Children (11-, 13- and 15-year-olds) who are overweight or obese (%) • Average disposable household income, 0-17 year-olds, 2010 USD PPP A ‘measure’ • Out-of-work income maintenance and support Useful, orthogonal information is often difficult to separate, map, and compare While the concept names/IDs are different, it makes mapping and transcoding very difficult! For the modeler, difficult to know how to construct the compound name. No standard, deterministic way to do this
Modelling the “Unit” information The unit informationmay be: • a compound “Statistical indicator”, or • decomposed into Unit characteristic concepts Compound indicators can be improved by: • removing the concept synonyms, and • harmonising the indicator composition to make mapping more consistent Now we describe the OECD’s method on how to decompose unit information using a methodological process. Even for compound indicators, it is good practice to decompose the information into separate concepts anyway, then merge back • Allows to standardize the composition of the indicator
The Mountain example Object: Chimborazo Property: height Value: 6263 Unit of measure: meters CC BY-SA: David Torres Costales
OECD’s Unit Decomposition Method • 3 mandatory unit concepts for all data: Measure, Unit of Measure, Unit multiplier • Measure: The entity+property being measured, e.g. Chimborazo height • Unit of Measure: Standard for measurement of the same kind of quantity • Use standard units • Typical UoMs: • Persons, Years, Metres, Tonnes, etc. • <Currency> • <Unit> per <unit>, e.g. National currency per US Dollar • Percentage points. Used for differences in percentages • Percentage of <denominator> • Avoid usingPercentage, Index or Rate, Share, Ratio alone. E.g. percentage of what? • Percentage should always be Percentage of something • Convert rates, share, ratio to percentages. For standardisation but also to remove synonyms • Unit multiplier: Exponent in base 10 used for calculating the actual value. Apply it to the observation value • Additional concepts that further qualify Unit of Measure • Base period: When the intrinsic value changes over time. Used in conjunction with • Prices: Current or Constant price. Needs Base period • Seasonal adjustment: Used to remove seasonal and calendar influences from a series • Time transformation: Coding for time-related operations on series: growth rates, moving averages, etc: Guidelines on coding time transformations in SDMX
Decomposition Examples Proportion (%) of children (aged 0-14) that live in households where all adults are in employment (working) Replaced by: • Measure: Children • UoM: Percentage of children • Unit multiplier: Units • Labour force status provider: All in employment • Age: 0 to 14 years • Living arrangement: Total Proportion (%) of children (aged 0-17) living with two parents Replaced by: • Measure: Children • UoM: Percentage of children • Unit multiplier: Units • Age: 0 to 17 years • Living arrangement: 2 parents
Decomposition Examples Total public social expenditure on families as a % of GDP Replaced by: • Measure: Public social expenditure on families • UoM: Percentage of GDP • Unit multiplier: Units • Benefit type: Total Public social expenditure on cash benefits for families as a % of GDP Replaced by: • Measure: Public social expenditure on families • UoM: Percentage of GDP • Unit multiplier: Units • Benefit type: Cash benefits Public social expenditure on services and in-kind benefits for families as a % of GDP Replaced by: • Measure: Public social expenditure on families • UoM: Percentage of GDP • Unit multiplier: Units • Benefit type: Services and in-kind benefits
Demo of SDMX Matrix Generator Open source Excel-based tool used to model domains/frameworks and generate SDMX-ML Available on Github: https://github.com/OECDSTD/sdmx-matrix-generator