140 likes | 249 Views
Canada’s Updated Case Study and the Benefits and Challenges of Implementing the Generic Statistical Information Model. Flavio Rizzolo Joint work with Tim Dunstan and Kathryn Stevenson Work Session on Statistical Metadata Geneva, 6-8 May 2013. What we do. Data Service Centre(s).
E N D
Canada’s Updated Case Study and the Benefits and Challenges of Implementing the Generic Statistical Information Model Flavio Rizzolo Joint work with Tim Dunstan and Kathryn Stevenson Work Session on Statistical Metadata Geneva, 6-8 May 2013
What we do Data Service Centre(s) Statistical Production Services • Steady-state datasetmanagement • Search / discover All these services use metadata in one form or another Collection Services Business Processing Services Generalized Systems Services Address Register Services Classification Services Dissemination Services Metadata Management • Survey Planning • Instrument Generation • ICOS Governance • Sample Management • Training • Collected Data Management • Response Collection • Workload Management • HR Management • Logistics • Case Management • Response Processing • Collection Systems Operation • Survey Pre-Production Testing • Survey Progress Monitoring • Pre-Collection Respondent Communication • Respondent Support • Internal Communication • Tabulation • Confidentiality • Sampling • Edit & Imputation • Common processingplatform for allbusiness / micro-economic surveys • EAIP web services • “New Dissemination Model” • Single Output Database • CLF compliance • OpenData portal • Syndication • Social Media • Web Services • Classification Management • Concordance Management • Common services • Statistical MetadataManagement Strategyimplementation • Metadata Portal • Stewardship • Model repository? • Metadata search Challenge 1: to make metadata available to all of them in an efficient, effective and controlled way Census Statistical Infrastructure Services • Census 2016 Platform Macro-economic Analysis & Modeling Business Register Services Challenge 2: to exchange metadata in a common format with minimum overhead • To be captured Social Processing Services • Common processing platform for all socio-economic, labour, health surveys Statistics Canada • Statistique Canada 2
Metadata management building blocks Completed Underway Planned • Integrated Metadata Base (IMDB) • Integrated Business Surveys Project (IBSP) • Centralized metadata repository • Integration with IMDB • Metadata + processing environment • Common Tools Project • Centralized Metadata repository • Integration with IMDB • Social Survey Metadata Environment (SSME) • Social Survey Processing Environment (SSPE) • Data Service Centres (DCS) • Integrated Service Oriented Architecture (SOA)
GSBPM and GSIM • The Generic Statistical Business Process Model (GSBPM) has been a StatCan reference model since 2010 – also being used to harmonize StatCan’sstat processing infrastructure • The Generic Statistical Information Model (GSIM) is being adopted to specify, design, and implement components for integration into “plug’n’play” architectures and link to standard formats (e.g. DDI, SDMX) • GSIM’s Concepts and Structures Groups will be the main classifiers of metadata and function as inputs/outputs of GSBPM statistical business sub-processes
Data models for input and outputs • Information has to be consistent across all relevant business units • However, the same abstract information object (e.g., survey,questionnaire, classification) can be physically implemented by differentdata producers (and consumers) in different ways • This “impedance mismatch” between producers and consumers’ views (and understanding) of data can be addressed either: • by forcing them to conform to each other’s data models (point-to-point data integration) • by creating canonical information models to which producers and consumers models will map (SOA data integration)
Canonical models • Canonical information models are enterprise- or segment-wide, common representations of information objects – a “lingua franca” for data exchange • Within a SOA framework, they are implemented as object models that are serialized into XML Schema Definition (XSD) types • XSD types of canonical models will be maintained in a repository that can be referenced and reused by multiple service contracts (WSDL) • XSD types will be maintained by the service developers within a governance framework • Producer and consumer schemas need to be mapped to the canonical metadata models via schema mappings – object-relational (ORMs) or object-XML (OXMs)
SOA (meta)data exchange model customized XSLTs for transforming XML structure for composite services automatic deserialization (may include customized XSLTs when consumer model is far from canonical) complex data transformations via custom SQL queries (possibly recursive) when source model is far from canonical applications can access consumer’s models directly automatic serialization and typing inventory of canonical metadata XSD types to be imported into WSDLs for reusability across services automatically generated when producer model is close to canonical
StatCan and GSIM synergy • Active groups • Plug & Play • Implementation • Mapping GSIM to DDI and SDMX • Information objects are being aligned with GSIM at the implementation level • A two-way convergence • GSIM to StatCan • Survey instrument, questionnaire and classification canonical models – Semantic work between Enterprise Architecture, SNA, IBSP and ICOS influenced by GSIM model • StatCan to GSIM • Separation of Flow Decision (Rule) and Flow Action (Control Transition) in GSIM version 1.0 – Participation in GSIM Production Group
Canonical questionnaire model and GSIM These entities may not be relevant for all GSBPM phases
Canonical classification model: object level Additional entities: to handle registration and more flexible formats Additional entities to handle registration and more flexible formats The item hierarchy is a tree: every item may have zero or more children Two items are in a parent-child relationship only if their respective levels are in a parent-child relationship as well The level hierarchy is linear: every level has at most one child.
IMDB DDI service proof-of-concept • Expose the IMDB repository using a Service Oriented Architecture (SOA) approach instead of point-to-point • Provide IMDB metadata content in a standard format (DDI v3.x) • Support applications that focus on different types of metadata (e.g., surveys, variables, classifications, concepts) • Support the Data Liberation Initiative (DLI) and the Canadian Research Data Centre Network (CRDCN) Metadata projects
IMDB DDI service architecture XSLT Transforms – from DDI to HTML, CSV and other internal data formats (key for interoperability and SOA) Proof-of-concept clients developed internally: JSP/Servlet, web client, standalone Java client Mapping between the IMDB physical model and the DDI XML schemas Implemented with SQL queries. Potential clients: .NET, SAS, Excel, Reports, DW Integration Services
(Near) future work • How to deal with change management in GSIM (not trivial once it has been implemented) • What is the best possible implementation of GSIM for SOAdata exchange? • Need to handle the complexity of data exchange across dozens of statistical production and infrastructure systems • Canonical models need to be simple and intuitive (easy to use by clients), and create little overhead • Need to consider light-weight alternatives to XML (e.g., JSON). • Does a single implementation “fit” the entire GSBPM? • Need to look atGSBPM and identify what level of detail is necessary for each information object within process/subprocess • Example: a canonical questionnaire model should include edits, flows and cells at the design and collect phases, but not at the process or disseminate phases. Similarly, a data quality metric may not be necessary before the process phase • Will GSIM “level 2” take phases into account?
Farther ahead… • Metadata: investigate large-scale entity resolution – entity identifiers should not be multiplied beyond necessity • Every DB has a different id for the same information object • Do we keep a centralized mapping between them? • Do we keep a centralized DNS-like system that assigns id’s to entities? (OKKAM project approach) • Architecture: explore alternative paradigms, e.g., event-driven architecture (EDA), to complement SOA • Subscription-based rather than request-based (e.g., RSS, Atom, etc.) • Loose coupling and scalability • SOA service composition vs. EDA syndication/aggregation • EDA subscribers need to be more sophisticated than SOA clients (e.g., need to be ready to store/handle event responses whenever they happen)