Two Paradigms for Official Statistics Production

Two Paradigms for Official Statistics Production Boris Lorenc, Jakob Engdahl and Klas Blomqvist Statistics Sweden

Preliminaries • The talk concerns data and knowledge about external world – not data and knowledge about producing statistics (but might have consequences for the latter) • Inspired by the different discussions on ongoing developments and initiatives within (official) statistics • May have certain relevance for editing • Naturally, the views presented herein are those of the authors, not necessarily reflecting policies of Statistics Sweden

Preliminaries (cont’d) • Transition from (many) Stovepipes to (few) Integrated System(s) • Among intended goals • better integration of administrative data and survey data, • better/faster response to new or changing user needs • How an integrated system should look like so as to satisfy these requirements • answer sought in the field of knowledge systems/cognitive systems

Agenda • Preliminaries • On some distinctions and results regarding knowledge/cognitive systems • Consequences for representing data in Integrated systems for statistics production • Further considerations for statistics methodology, including some thoughts regarding editing

Knowledge/Cognitive Systems • Computational • symbolic • first-order predicate logic • other formal logic • etc • subsymbolic • artificial neural networks (ANNs) • etc • Other (noncomputational) • embodied cognition • situated cognition • socially distributed cognition • etc Good for restricted domains with clear rules (e.g. chess), less good for open-world problems

Database developments • Relational Model • RDBMS (Relational Database Management System) • implements first-order predicate logic • database schema: theory in predicate calculus • NoSQL • schema-less (theory-less) • examples • Google‘s BigTable • solutions underlying some functions on Amazon, Twitter, and Facebook • Perhaps related: Semantic Web • how to structure documents into a “web of data” • “a web of data that can be processed directly and indirectly by machines” • uses Resource Description Framework (rather than RDBMS)

Consequences • likely requires expert assistance to users in search and requirements specification • likely empowers users to themselves explore available data and consider merits of requiring new data • Paradigm I: Stovepipe + RDBMS • ‘manual’ management of a fairly restricted domain • single-purpose use likely requires expert assistance to users in search and requirements specification • Paradigm II: Integrated system + noSQL • automatic building of world knowledge pertaining to the domain • multi-purpose use likely empowers users to themselves explore available data and consider merits of requiring new data

Sampling theory considerations • In the context of Paradigm II: • use of weights • what should they then reflect: • inclusion probabilities (if known)? • nonresponse information (including an assumed model)? • auxiliary information pertaining to specific variables to be estimated? • use of models • memorylessness vs. Bayesian statistics

Editing • Editing for a purpose vs. editing “without a purpose” • adherence to general specifications (‘concept validity’) • self-learning (unsupervised) tools from computer science/ANN • model congruence (especially building automatic models using methods from the KDD (Knowledge Discovery and Data Mining) field • more?

Conclusions • The distinction likely not as clear-cut as presented here, however the trend discernible: • transition from “manual” to automatic processing • potential increased need to use models • In building representations of “world knowledge”, in addition to RDBMS, pay attention to developments in NoSQL, Big Data, and similar • Perhaps strengthen work on • general-purpose data editing • automated data editing • model use • ... (as already advanced in several contributions to the workshop)

Thank you boris.lorenc@scb.se

Two Paradigms for Official Statistics Production

Two Paradigms for Official Statistics Production

Presentation Transcript

Official Statistics

Two Paradigms for Official Statistics Production

Two paradigms for conservation biology

Modernising Official Statistics

Code of Practice for Official Statistics

Official International Statistics

Two Paradigms for Natural-Language Processing

Unlocking non-official statistics

The two Paradigms

Understanding Official Statistics

Using administrative sources for official statistics:

Two Official System

CONTRIBUTIONS OF OFFICIAL STATISTICS

Culture: Two Paradigms

Quality criteria for official statistics

Statistics South Africa Official statistics; Statistics Act

Review: Two Programming Paradigms

Innovation in Official Statistics