300 likes | 464 Views
Data Quality Case Study. Prepared by ORC Macro. Data Correction. Background Data Correction Tracking system SAS AF query application Guidelines Profile Analysis SSNs Names. Profile Analysis—SSNs. Profile Analysis—SSNs. Shared SSNs (n=7,100). Candidates for Correction.
E N D
Data Quality Case Study Prepared by ORC Macro
Data Correction • Background • Data Correction • Tracking system • SAS AF query application • Guidelines • Profile Analysis • SSNs • Names
Profile Analysis—SSNs Shared SSNs (n=7,100) Candidates for Correction Different Names 27% Candidates for Collapse Same or Similar Names 73%
Possible Duplicates 23% n=79,300 Unique Persons 77% n=267,081 Profile Analysis—Names
OLTP—Commons Cases • Definition • Statistics • Status
Data Correction • Identifying the extent of the problem • Investigating based on type of error • Validating the investigation • Implementing the change • Tracking the identification, investigation, validation, and implementation
Data Correction—An Example PERSON_ID=3070908—PPRF record • Identification of problem • Two different middle initials found • Investigation of problem • TA module • Scripts run • Validation of information • Name, SSN, degree(s), grant(s) • Sources
Data Correction—An Example PERSON_ID=3070908—PPRF record • Implementation of correction • Grants report submitted to NIH OD • Tracking of correction • Internal tracking system • Post-correction • Loss of control of data
Developing a Data Quality Business Plan
Focus of Our Activities Examination of the Database, Procedures, and Interface Development of Modified Use Cases Unified Modeling Language Identification and Extraction of Business Rules Identification of Business Model
Data Quality Issues • Type-over of information • Generation of duplicate persons • Collapsing • Changes in degree and address data • Generation of orphans
Type-Over Practices • Intentions: • Assign a new principal investigator (PI) to a grant • Change the name of a PI on a grant • Correct a misspelled name • Consequences: • Inclusion of incorrect information in a person profile • Absence of linkages between PIs and grant applications • Creation of false linkages between PIs and grant applications
Factors Affecting Quality • Relatively easy access to person-related data elements • Lack of self-validation routines • Interface issues
Solutions • Restricted access • Quality control validation • Interface simplification • Self-validation algorithm
Who does it? ICs A Quality Assurance group Other How is it done? Staging areas Manual and intelligent filtering Architecture Data Quality Validation
Self Validation • Name-matching algorithm • Consistency checking
Higher-Level Analysis The following are being examined relative to their effect on quality: • Commons interface with IMPAC II • Database redundancy • Business rules in the database • Master person file • Front-end design • Human factors • Ownership
Development of a Data Quality Model
Major Goals Quality improvements plan for personal identifiers • Evaluate the different identification algorithms currently in use for IMPAC II • Develop identification algorithm(s) and procedures • Serve as consultant and guarantor of efficacy of algorithm implementation
Moving Forward • Understanding the technical infrastructure • Identification of specific areas of concern • Development/proposal of data quality expectations • Development/proposal of appropriate, acceptable solutions
Data Quality White Paper Knowledge assets are very real and carry tremendous value. Outline • Definition • Rules • Risks and Costs • NIH Expectations • Process • Measurements/Metrics • Testing • Continuous Improvements • Conclusions
Conclusion Examination of the Database, Procedures, and Interface Development of Modified Use Cases Unified Modeling Language Identification and Extraction of Business Rules Identification of Business Model Develop- ment/Proposal of Appropriate, Acceptable Solutions Development/Proposal of Data Quality Expectations Identification of Specific Areas of Concern Understanding the Technical Infrastructure