220 likes | 348 Views
WOSQ2011. SNEED. A Process for Assessing Data Quality Harry M. Sneed ANECON GmbH, Wien University of Regensburg, Bayern OÖ-Fachhochschule Hagenberg, Austria. ACM Workshop on Software Quality Szeged, Hungary 4th September 201 1. WOSQ2011. SNEED-1. The Problem of Data Quality.
E N D
WOSQ2011 SNEED A Process for AssessingData QualityHarry M. Sneed ANECON GmbH, WienUniversity of Regensburg, BayernOÖ-Fachhochschule Hagenberg, Austria ACM Workshop on Software Quality Szeged, Hungary 4th September 2011
WOSQ2011 SNEED-1 The Problem of Data Quality • The original data model was never thought through and does not satisfy the requirements on the data. • The original database design was once adequate, but over the years it has become obsolete as data requirements changed. • The database structure has depreciated under the weight of many minor alterations (new attributes, keys and Indexes) • The database is now used for other purposes than for which it was originally intended. • As a result of program faults more and more errors have crept unnoticed into the data and corrupted it. • An ever increasing number of dead data objects has accumulated over time and burdens the database. • Data redundancy leads to inconsistent information. • As a result of continual evolution, the data structure has become so complex that it can no longer be evolved.
WOSQ2011 SNEED-2 The Purpose of a Data Audit • A Data Audit is a process by which the quality of the database is assessed and suggestions for improvement are made. • The Process encompasses two parallel activities: • Static Analysis of the Database Schema • Static Analysis of the Database Content • The static analysis of the database schema entails an inspection and measurement of the data structures and data type definitions. • The static analysis of the database content entails a validation of the actual data content against a specification of what that content should be. • The goal is to uncover discrepancies and deficiencies in the data and to propose means of eliminating or improving them.
WOSQ2011 SNEED-3 Static Analysis of the Database Schema • The database schema is compared against the rules for data structuring, including the rules of data normalization. • The data type definitions are checked for compliance and incompliant types reported. • Structural deficiencies in the database schema are searched for and uncovered. • Size and complexity limits are controlled and excesses noted. • The size, complexity and quality of the database structure is measured to compare with other databases and with a benchmark. • Recommendations for the improvement of the database structure are suggested.
WOSQT2011 Database Analyst SNEED-4 Setting the rules Defining themetrics Rules Metrics Database Schema Database Views Database Schema Auditor Deficiency Report Metric Report Figure 1: Static Database Analysis
WOSQ2011 SNEED-5 Checking for Database Deficiencies • Typical relational data deficiencies: • There is no unique primary key • Foreign key to related table is missing • There are no views defined upon a table • No indexes are defined for a table • Key contains too many sub keys • Table contains more than allowed number of columns • Table has more than allowed number of indexes • Table has more than allowed number of views • Table row exceeds maximum allowed length • Table contains incompatible data types • Null Option is missing • Delete Option is missing • Data names do not conform to naming convention • Data schema is inadequately commented.
WOSQ2011 SNEED-6 Sample Database Deficiency Report | | | | WARN: | 22 TimeStamp Data is not allowed | | 253| LEDG_DAT_INS DATETIME YEAR TO SECOND DEFA | | | | | WARN: | 18 Variable Character Data is not allowed | | 254| LEDG_USR_INS VARCHAR (20) DEFAULT USER, | | | | | WARN: | 17 Not null Option is missing | | 254| LEDG_USR_INS VARCHAR (20) DEFAULT USER, | | | | | WARN: | 17 Not null Option is missing | | 256| LEDG_USR_UPD VARCHAR (20), | | | | | PROB: | 05 Table has no View definition | | 258| EPSLEDGE | | | | | DEFI: | 08 Table has too many relationships (foreign keys) | | 258| EPSLEDGE | | | | | DEFI: | 11 Row Length exceeds maximum length allowed | | 258| EPSLEDGE | | | | | DEFI: | 12 Table has too few comments | | 258| EPSLEDGE | | | | +----------------------------------------------------------------------+ | Number of major Rule Violations = 11 | | Number of media Rule Violations = 11 | | Number of minor Rule Violations = 132 | | Total Number of Rule Violations = 154 | | Number of Database Schema Lines = 267 | | | | Rate of Rule Conformity = 0.651 | +----------------------------------------------------------------------+
WOSQ2011 SNEED-7 Database Quality Metrics • data independence = number of views and indexes relative to the number of tables or record types • data accessibility = number of keys, indexes and selects relative to the number of tables, • data flexibility = number of attributes and keys relative to the number of tables, • data testability = number of test cases relative to the number of data attributes stored, • data storage efficiency = number of attributes stored relative to the size of the storage in bytes, • data conformity = number of schema deficiencies relative to the number of schema lines.
WOSQ2011 SNEED-8 Database Complexity Metrics • data complexity = number of different data types and keys , relative to the sum of all data attributes stored, • view complexity = number of views relative to the number of tables, • access complexity = number of tables, keys and indexes relative to the number of stored attributes, • relational complexity = number of foreign keys and relations relative to the number of tables and keys in all, • structural complexity = number of structural elements - tables, relations and views – relative to the number of stored attributes, • storage complexity = number of attributes stored relative to the storage capacity.
WOSQ2011 SNEED-9 Size Metrics of individual Data Tables +------------------------------------------------------------------------------+ | | | SQL DATA BASE METRIC REPORT | |SYSTEM: DATABASE | | DATE: 10.03.11 PAGE: 001 | +------------------------------------------------------------------------------+ | NR NR DATA PRIM FOR NR NR ROW DATA FUNC NR | | TABLE NAME LINES FLDS TYPES KEYS KEYS RELS INDX SIZE PTS PTS DEFIS| +------------------------------------------------------------------------------+ |[WEEKDAYS] 012 03 02 01 00 01 00 104 009 007 002 | |[ERRORLOGS] 016 05 02 01 00 01 00 017 011 007 002 | |[AUSWERTUNGTYP] 016 06 03 01 00 01 00 163 012 007 002 | |[FEATUREDMITGLIEDPOO 014 05 04 01 00 01 00 774 011 007 003 | |[FACHGRUPPEN] 028 10 06 01 00 01 00 101 016 007 002 | |[KONTAKTARTEN] 014 04 02 01 00 01 00 053 010 007 002 | |[CATEGORIES] 022 06 02 01 00 01 00 021 012 007 002 | |[AUSWERTUNGSPALTE] 012 03 02 01 00 01 00 101 009 007 002 | |[MENUCATEGORIES] 012 03 02 01 00 01 00 058 009 007 002 | |[SITEMAP] 024 12 04 01 00 01 00 1309 018 007 003 | |#FILTERCRITERIA 010 02 02 00 00 01 00 008 006 007 003 | |[BUSINESSHOURS] 014 05 03 01 00 01 00 017 011 007 002 | |[ZERTIFIKATE] 034 10 05 01 00 01 00 755 016 007 003 | |#FILTERCRITERIA 010 02 02 00 00 01 00 008 006 007 003 | |[REPORTINGHISTORY] 018 09 04 01 00 01 00 538 015 007 003 | |#FILTERCRITERIA 010 02 02 00 00 01 00 008 006 007 003 | |[MITARBEITERFUNKTION 013 04 04 01 00 01 00 058 010 007 002 | |#FILTERCRITERIA 010 02 02 00 00 01 00 008 006 007 003 | |[SCHLAGWORTE] 019 03 03 01 00 01 00 257 009 007 002 | |[PRODUKTE] 017 07 03 01 00 01 00 1205 013 007 003 | |[REGIONEN] 022 12 06 01 00 01 00 142 018 007 002 | |[REPORTTYPE] 012 03 03 01 00 01 00 058 009 007 002 | |[SUCHNAMEPATTERNS] 007 02 02 00 00 01 00 017 006 007 003 | |[STATISTIKOHNEPRODUK 018 09 04 01 00 01 00 523 015 007 003 | |[BANNER] 014 05 02 01 00 01 00 773 011 007 003 | |[STATISTIKOHNEEDITIE 018 09 04 01 00 01 00 523 015 007 003 |
WOSQ2011 SNEED-10 Summary Report of Database Size Metrics +------------------------------------------------------------------------------+ | D A T A B A S E Q U A N T I T Y M E T R I C S | +------------------------------------------------------------------------------+ | Total Number of SQL Source Files ===> 01 | | Total Number of SQL Code Lines ===> 20107 | | Total Number of SQL Comment Lines ===> 2320 | | Total Number of SQL Procedure Statements ===> 7842 | | Total Number of SQL Statement Types ===> 3245 | | Total Number of SQL Access Operations ===> 3732 | | Total Number of SQL Select Operations ===> 1323 | | Total Number of SQL Insert Operations ===> 334 | | Total Number of SQL Update Operations ===> 73 | | Total Number of SQL Delete Operations ===> 43 | | Total Number of SQL Stored Procedures ===> 1497 | | Total Number of SQL Procedure Blocks ===> 558 | | Total Number of SQL Procedure Conditions ===> 1933 | | Total Number of SQL Procedure Loops ===> 09 | | Total Number of SQL Procedural Branches ===> 2439 | | Total Number of SQL Data Declarations ===> 266 | | Total Number of SQL Set Data Assignments ===> 1325 | | Total Number of SQL Output Operations ===> 1074 | | Total Number of SQL Input Operations ===> 749 | | Total Number of Stored Procedure Points ===> 4806 | | Total Number of SQL Tables ===> 62 | | Total Number of Data Attributes ===> 468 | | Total Number of different Data Types ===> 202 | | Total Number of primary Keys ===> 56 | | Total Number of Foreign Keys ===> 00 | | Total Number of DB-Indexes ===> 17 | | Total Number of Relationships ===> 62 | | Total Number of Database Views ===> 15 | | Total Number of Database View Attributes ===> 119 | | Total Length of all SQL Rows ===> 36028 | | Total Number of Data-Points ===> 10884 | | Total Number of Function-Points ===> 434 | | Total Number of Deficiencies ===> 311 | | Total Number of required Test Cases ===> 1377 | +------------------------------------------------------------------------------+
WOSQ2011 SNEED-11 Sample Complexity and Quality Metrics +------------------------------------------------------------------------------+ | D A T A B A S E C O M P L E X I T Y M E T R I C S | +------------------------------------------------------------------------------+ | Data Complexity ===> 0.611 | | Access Complexity ===> 0.635 | | Relational Complexity ===> 0.609 | | Structural Complexity ===> 0.228 | | Storage Complexity ===> 0.120 | | | | Average Database Complexity ===> 0.440 | | | +------------------------------------------------------------------------------+ | D A T A B A S E Q U A L I T Y M E T R I C S | +------------------------------------------------------------------------------+ | Data Independence ===> 0.471 | | Data Accessability ===> 0.859 | | Data Testability ===> 0.518 | | Data Conformity ===> 0.984 | | Storage Efficiency ===> 0.200 | | | | Average Database Quality ===> 0.606 | | | +------------------------------------------------------------------------------+
WOSQ2011 SNEED-12 Static Analysis of the Database Content • The database contents are checked against assertions on the valid value domains as well as against assertions on valid relationships. • Invalid data values are discovered thru comparison with valid values and recorded. • Missing data objects are reported. • Redundant data objects and data corpses are revealed and reported. • The degree of correctness is computed for each data table to compare with other tables. • The correctness metric of the database content is passed on to the metric database for further evaluation.
Data Validation Process WOSQ2011 SNEED-13 Database Schema Database Analyst Validation script Specify the scripts Validation Script Compiler Comparison Tables Data Base Content Database Tables Database Content Auditor Validation Report Figure 2: Data Validation Process
WOSQ2011 SNEED-14 Sample Assertion Script table: LAENDER; // This is test procedure is for checking the correctness of the Bundesländer when (new.BUNDESLAND_ID = "1") then replace new.SUMME by "11"; when (new.BUNDESLAND_ID = "2") then replace new.SUMME by "12"; when (new.BUNDESLAND_ID = "3") then replace new.SUMME by "13"; when (new.BUNDESLAND_ID = "4") then replace new.SUMME by "14"; when (new.BUNDESLAND_ID = "5") then replace new.SUMME by "15"; when (new.BUNDESLAND_ID = "6") then replace new.SUMME by "16"; when (new.BUNDESLAND_ID = "7") then replace new.SUMME by "17"; when (new.BUNDESLAND_ID = "8") then replace new.SUMME by "18"; when (new.BUNDESLAND_ID = "9") then replace new.SUMME by "19"; when (old.STAATSID = "14") then replace old.STAATSID by "13"; when (new.SUMME = "65") then replace new.SUMME by "66"; when (old.BUNDESLANDNR = "9") then skip; if ( new.BUNDESLAND_ID = old.BUNDESLANDNR ! "0" ); assertnew.STAAT_ID = old.STAATSID; assertnew.BUNDESLAND = "NIEDEROESTERREICH" if (old.SUMME = "33";) assertnew.BUNDESLAND = "OBEROESTERREICH" if (old.SUMME = "44"); assertnew.BUNDESLAND = old.BUNDESLANDif (other); assertnew.BUNDESLAND = old.BUNDESLAND; assertnew.SUMME = "100" ! "200" ! "300" ! "400"; assertnew.SUMME = {1.00:5.00}; assertnew.MITGLIEDER = "300" ! "656" ! "644" ! "92"; end;
WOSQ2011 Sample Data Validation Report SNEED-15 • | Key Fields of Record(new,old) • +-----------------------------------------------------------------------------------------+ • | Ist :BUNDESLANDNR • | Soll:BUNDESLANDNR • +----------------------------------------------+-----------------------------------------+ • | Non-Matching Fields | Non-Matching Values • +----------------------------------------------+-----------------------------------------+ • | RecKey:1 | • | Ist : BUNDESLAND | NIEDERöSTERREICH • | Soll: BUNDESLAND | NIEDEROESTERREICH • +----------------------------------------------+-----------------------------------------+ • | RecKey:2 | • | Ist : STAATSID | 13 • | Soll: STAATSID | 12 • +----------------------------------------------+-----------------------------------------+ • | RecKey:2 | • | Ist : BUNDESLAND | OBERöSTERREICH • | Soll: BUNDESLAND | OBEROESTERREICH • +----------------------------------------------+-----------------------------------------+ • | RecKey:4 | • | Ist : STAATSID | 13 • | Soll: STAATSID | 14 • +----------------------------------------------+-----------------------------------------+ • | RecKey:5 | • | Ist : BUNDESLAND | KäRNTEN • | Soll: BUNDESLAND | KAERNTEN • +----------------------------------------------+-----------------------------------------+ • | RecKey:8 | • | Ist : STAATSID | 13 • | Soll: STAATSID | 18 • +----------------------------------------------+-----------------------------------------+
WOSQ2011 Data Validation Metric Report SNEED-16 +----------------------------------------------------------------------------------------+ | New:MGNR SUBNR JAHR ART LFNR | | Old:MGNR SUBNR JAHR ART LFNR | +----------------------------------------------+-----------------------------------------+ | Non-Matching Fields | Non-Matching Values | +----------------------------------------------+-----------------------------------------+ | RecKey:000083 0000 0000 BAN 00002 | | | Ist: DATUM | 2002-04-29 | | Soll: DATUM | 2003-04-10 | +----------------------------------------------+-----------------------------------------+ | RecKey:000095 0000 0000 BAN 00002 | missing from the old File/Table | +----------------------------------------------+-----------------------------------------+ | RecKey:000112 0000 0000 BAN 00002 | | | Ist: DATUM | 2002-05-02 | | Soll: DATUM | 2003-04-29 | +----------------------------------------------+-----------------------------------------+ | RecKey:000083 0000 0000 BAN 00001 | missing from the new File/Table | +----------------------------------------------+-----------------------------------------+ | RecKey:000095 0000 0000 BAN 00001 | missing from the new File/Table | +----------------------------------------------+-----------------------------------------+ | RecKey:000112 0000 0000 BAN 00001 | missing from the new File/Table | +----------------------------------------------+-----------------------------------------+ +----------------------------------------------+-----------------------------------------+ | Total Number of old Records checked: 76 | | Number of old Records found in new File: 73 | | Number of old Records with duplicate Keys: 00 | | Number of old Records not in new Table: 03 | | Total Number of new Records checked: 76 | | Number of new Records found in old File: 75 | | Number of new Records with alternate Keys: 00 | | Number of new Records not in old File: 01 | | Total Number of Fields checked: 225 | | Total Number of non-Matching Fields: 02 | | Percentage of matching Fields: 99 % | | Percentage of matching Records: 99 % | +----------------------------------------------------------------------------------------+
WOSQ2011 Database Size Comparison SNEED-17
WOSQ2011 Database Complexity Evaluation SNEED-18
WOSQ2011 Database Quality Evaluation SNEED-19
0,5 0,6 0,4 1 0 Complexity WOSQ2011 Database Dashboard SNEED-20 0,003 0,5 0,4 0,6 0,002 0,001 1 0 0 0,020 Quality Defect Density Defects per 1000 Data 0,1 0,8 0,7 0,7 0,05 0,6 0,9 0,6 0,8 1 0 0,5 1,0 0,5 1,00 Change rate Correctness Conformity Visualization of the Database Status with SoftEval
WOSQ2011 SNEED-21 Conclusions • Databases are an important asset of corporate users, perhaps even more important than the software! • As with software the quality of data decreases over time as a result of usage and change. • The complexity of data increases over time as a result of change and extension. • Therefore, it is imperative to periodically clean up and restructure the data. • For that purpose one needs information on the state of the data. • Such information is provided by a database audit.