Quality of PSI

Quality of PSI Robbin te Velde Helsinki, 19-20 April 2007

Outline of the presentation • Short (philosophical) introduction on quality • Data management & data quality (practice and theory) • Conventional data management • PSI enlightened models • Quality and pricing

Defining the elusive concept of Quality (I) • Common definitions of quality (Garvin, 1984) : • Transcendent: “quality is neither mind nor matter, but a third entity independent of the two […] even though Quality cannot be defined, you know what it is” (Pirsig, 1974) • Product-based: “differences in quality amount to differences in the quantity of some desired ingredients or attribute” (Abbot, 1955) • Manufacturing-based: “quality means conformance to requirements” (Crosby, 1984) • Value-based: “quality means best for certain customer conditions. These conditions are (a) the actual use and (b) the selling price of the product” (Feigenbaum, 1961) • User-based: “quality is fitness for use” (Juran, 1988)

Defining the elusive concept of Quality (II) • There is no unambiguous definition of quality. • Each definition stresses other dimensions in quality management [thus] • the specific interpretation of qualityis no neutral process but is both cause and effect of internal (management x staff) and external (organisation x customer; buyer x supplier) relations. • In each era one particular definition has been dominant. • Over the centuries there has been a shift from the transcendent to the product and manufacturing-based via the value-based back to the more transcendent user-based definition.

The grim reality of data quality • Lack of Metadata Management • no common data definitions exists about what data means (e.g., shared vocabulary) • No clarity on data ownership • Users create, modify and access data but nobody sees it as its responsibility to own it (fear of ‘blame culture’) • Poor data quality • no common consistent way of validating data across applications • Massive data redundancy and fractured inconsistent data across different systems • significant data re-keying • maintenance of master data attributes done in different systems • two-way data flow between systems to synchronise the same data • Business process outsourcing occurring without process integration and/or integrated master data management • Fractured unmanaged unstructured content • no CMS and/or taxonomy to organise the content

Regulatory Compliance (SOX, etc.) • Data Quality Analysis (including Data Profiling) • Data Cleanup Campaigns and Programs • Data Quality Requirements Analysis • Data Quality Auditing and Certification Data Architecture, Analysis & Design Unstructured Data Management Data Quality Improvement Database Administration Data Stewardship, Strategy & Governance Reference & Master Data Management Data Security Management Data Warehousing & Business Intelligence Metadata Management Data Quality Improvement as part of Data Management (I)

Quality criteria (filter) Primary process (MoE: tons) EC database Re-use of information (investors) Example I: Unique identification of companies at the Dutch Basic Business Register (BBR) [source: Human Inference] Example II: Lack of common data definition between Czech Statistical Office (CZSO) and Ministry of Environment (MoE) [source: prof. Jiri Hřebiček, Masaryk University, Brno, Czech Republic] accept accept Same value Re-use Primary process hospital Primary process (CZSO: kg) reject Conventional Data Management (II) • This model is still very much within the manufacturing-based tradition of quality control • Quality is defined as the accuracy of the product (the data) • Assumes existence of ex ante, objective, uniform quality criteria • Works well but only under certain conditions (stable, well-defined operational environment)

350,000 objects x 3 types of address x 15 object categories • Search options were only based on exact matches, so all ‘fuzzy’ duplicates (e.g., alternative spelled names) were not found • the number of combinations was already so big that the filtering took several seconds • This often resulted in duplicates because once users has searched for a few seconds without any results, they simply created a new record • The official Dutch government portal Overheid.nl has a strict policy not to allow any content from third (private) parties on the website. • This is not a particularly citizen-centered approach but the official policy statement is that they only want “100% certified” information and that they thus do not accept content generated by processes which are not fully under their own control. • Reference data has different meaning to different users and the quality of this data is related to the requirements of each user. • Some reference data may be more critical than others depending on its use at the time. • The solution choosen it to built a unique dynamic list of business rules for each user, based on the qualitative feedback obtained from that user. Example III: Address validation at RWTÜV AG (Germany) [source: Human Inference] Example IV: Content syndication at Dutch government portal (overheid.nl) Example V: Use of business rules for UK security trading (private sector) [source: Finsoft Ltd] Limitations of conventional Data Management • Ex ante objective criteria are never 100% complete • It is impossible to define beforehand all possible combinations • If you do not include enough combinations you miss the ‘fuzzy’ ones • If many combinations are included filtering takes too long • The (futile) effort to go for 100% accuracy hampers process outsourcing • Reference data has different meanings to different people; the quality of this reference data is related to the requirements of each user

Primary process Final use Re-use Example VI*: Conventional (‘closed’) model for PSI quality control (ex post quality control outsourced to private sector, e.g. Acxiom) ex ante quality control ex post quality control public sector private sector Quality criteria (filter) Quality criteria (filter) Primary process Final use Re-use Example I: Unique identification of companies at the Dutch Basic Business Register (BBR) [source: Human Inference] Example I: Unique identification of companies at the Dutch Basic Business Register (BBR) [source: Human Inference] Example VI: Conventional (‘closed’) model for PSI quality control (cf. the Czech waste case) accept accept accept accept Re-use Re-use Primary process Primary process ex ante quality control ex post quality control public sector private sector reject reject Hidden assumptions of conventional (closed) model for PSI quality control • There is a strict split between (public sector) generation of data and (private sector) re-use of that information • The flow of data is unidirectional • The generator of the data is solely responsible for the quality of the data • Lack of quality of PSI is an important obstacle for re-use

Primary process Final use Re-use Example VII: Intertwined, multidirectional data flows Primary process Final use Re-use public sector private sector Primary process Final use Re-use Example X: Central role of government in quality management Example VIII: Co-management of quality (geo-information Norway) Primary process Primary process Final use Re-use Final use Re-use public sector private sector Quality criteria (filter) Quality criteria (filter) public sector private sector Example I: Unique identification of companies at the Dutch Basic Business Register (BBR) [source: Human Inference] Example I: Unique identification of companies at the Dutch Basic Business Register (BBR) [source: Human Inference] accept accept accept accept Re-use Re-use Primary process Primary process Example IX: Co-management of quality by final user (Latvia) Primary process reject reject Final use Re-use public sector private sector ‘Enlightened’ models for PSI quality control • The generation, re-use and final use are intertwined • The flow of data is multidirectional • The public content holder does not have to be the generator but is always at least partlyresponsible for the quality of the data • Lack of fitness for (re)use is an important obstacle for re-use (not lack of primary data quality per se)

‘fit for re-use’ First price Example XII: High-end market (‘fit for re-use’) coexists with low end market Example XI: Commercial re-use excludes free use for public at large high quality Primary process Primary process Primary process Final use’ Final use Final use Final use Re-use Re-use Re-use low quality public sector no price public sector private sector Quality criteria (filter) Quality criteria (filter) Quality criteria (filter) private sector Example I: Unique identification of companies at the Dutch Basic Business Register (BBR) [source: Human Inference] Example I: Unique identification of companies at the Dutch Basic Business Register (BBR) [source: Human Inference] Example I: Unique identification of companies at the Dutch Basic Business Register (BBR) [source: Human Inference] accept accept accept accept accept accept Re-use Re-use Re-use Primary process Primary process Primary process Example XI: Commercial re-use excludes free use for public at large reject reject reject public sector private sector Quality and pricing • Many public content holders fear that opening up their information (for free) to the public at large cannibalizes their income from commercial re-use. • In general though low end and high end markets can very well co-exist. The price discrimination is based on differences in quality • In the specific case of information goods, this quality does not always refer to the quality of the primary data itself but especially to the ‘fitness for re-use’.

ContactRobbin te Velde (tevelde@dialogic.nl)20 April 2007

Quality of PSI

Quality of PSI

Presentation Transcript

PSI Online

PSI-2 - the pilot experiment of the JULE-PSI project

Psi-Blast

The State of PSI

PSI Tanzania

PSI

PSI data quality

LIABILITY OF PSI

PSI ■ ■

PSI