110 likes | 572 Views
Two Paradigms for Official Statistics Production. Boris Lorenc, Jakob Engdahl and Klas Blomqvist Statistics Sweden. Preliminaries. The talk concerns data and knowledge about external world – not data and knowledge about producing statistics (but might have consequences for the latter)
E N D
Two Paradigms for Official Statistics Production Boris Lorenc, Jakob Engdahl and Klas Blomqvist Statistics Sweden
Preliminaries • The talk concerns data and knowledge about external world – not data and knowledge about producing statistics (but might have consequences for the latter) • Inspired by the different discussions on ongoing developments and initiatives within (official) statistics • May have certain relevance for editing • Naturally, the views presented herein are those of the authors, not necessarily reflecting policies of Statistics Sweden
Preliminaries (cont’d) • Transition from (many) Stovepipes to (few) Integrated System(s) • Among intended goals • better integration of administrative data and survey data, • better/faster response to new or changing user needs • How an integrated system should look like so as to satisfy these requirements • answer sought in the field of knowledge systems/cognitive systems
Agenda • Preliminaries • On some distinctions and results regarding knowledge/cognitive systems • Consequences for representing data in Integrated systems for statistics production • Further considerations for statistics methodology, including some thoughts regarding editing
Knowledge/Cognitive Systems • Computational • symbolic • first-order predicate logic • other formal logic • etc • subsymbolic • artificial neural networks (ANNs) • etc • Other (noncomputational) • embodied cognition • situated cognition • socially distributed cognition • etc Good for restricted domains with clear rules (e.g. chess), less good for open-world problems
Database developments • Relational Model • RDBMS (Relational Database Management System) • implements first-order predicate logic • database schema: theory in predicate calculus • NoSQL • schema-less (theory-less) • examples • Google‘s BigTable • solutions underlying some functions on Amazon, Twitter, and Facebook • Perhaps related: Semantic Web • how to structure documents into a “web of data” • “a web of data that can be processed directly and indirectly by machines” • uses Resource Description Framework (rather than RDBMS)
Consequences • likely requires expert assistance to users in search and requirements specification • likely empowers users to themselves explore available data and consider merits of requiring new data • Paradigm I: Stovepipe + RDBMS • ‘manual’ management of a fairly restricted domain • single-purpose use likely requires expert assistance to users in search and requirements specification • Paradigm II: Integrated system + noSQL • automatic building of world knowledge pertaining to the domain • multi-purpose use likely empowers users to themselves explore available data and consider merits of requiring new data
Sampling theory considerations • In the context of Paradigm II: • use of weights • what should they then reflect: • inclusion probabilities (if known)? • nonresponse information (including an assumed model)? • auxiliary information pertaining to specific variables to be estimated? • use of models • memorylessness vs. Bayesian statistics
Editing • Editing for a purpose vs. editing “without a purpose” • adherence to general specifications (‘concept validity’) • self-learning (unsupervised) tools from computer science/ANN • model congruence (especially building automatic models using methods from the KDD (Knowledge Discovery and Data Mining) field • more?
Conclusions • The distinction likely not as clear-cut as presented here, however the trend discernible: • transition from “manual” to automatic processing • potential increased need to use models • In building representations of “world knowledge”, in addition to RDBMS, pay attention to developments in NoSQL, Big Data, and similar • Perhaps strengthen work on • general-purpose data editing • automated data editing • model use • ... (as already advanced in several contributions to the workshop)
Thank you boris.lorenc@scb.se