230 likes | 330 Views
United Nations Regional Workshop on Census Data Processing Contemporary Technology from Census Data Capturing and Editing: A perspective of South Africa Data Processing System A presentation by South African Data Processing Team Dar-es-Salaam, Tanzania, 9-13 June 2008. Introduction
E N D
United Nations Regional Workshop on Census Data Processing Contemporary Technology from Census Data Capturing and Editing: A perspective of South Africa Data Processing System A presentation by South African Data Processing Team Dar-es-Salaam, Tanzania, 9-13 June 2008
Introduction Data processing Goal Planning phase Design of Data Processing System System Development & Testing Implementation & Operations Process flow Document Management System Progress reporting Tool of scanning Exceptions Quality Assurance (QA) Accounting or Balancing process Data validation & Editing Tabulation and output products The presentation layout
Data Processing is considered as part of Survey operations value chain (Proper define accountability structure); There is a define inter-dependency links with other Census sections (i.e. questionnaire design, Data collection,…); Heavily dependent on the available support in information technology around the country (Outsourcing of management of the system); Tight project management principle checking timeline, resources and detailed production lines Obliged to adapt on ever changing technology (1996 KFP, 2001 Census scanning, 2007 scanning with old scanner, 2011 Census scanning with upgraded scanners) Introduction
To accurately process or convert the statistical information from different collection tools such as the questionnaire into a comprehensive electronic data that is clean, accurate, consistent and reliable. Goal of Data Processing
Going through the lessons learned from previous censuses and surveys (1996 Census, 2001 census, 2007 Community Survey) Preparation of processing site In 1996 Census, distributed data processing centre in 9 provinces In 2001 Census and 2007 CS centralized data processing centre Mode of Data Capturing In 1996 Census Manual capturing (key from paper) running on SQL database with interface developed in visual basic In 2001 Census and 2007 CS: Use of proprietary scanning technology linked to Oracle database Census Budget The 1996 Census budget estimated at 500 Million Rand The 2001 Census budget estimated at 1.2 Billion Rand The 2007 CS budget estimated at 600 Million Rand Human Resource The 1996 Census have more staff for key from paper (options considered for Job creation across the country) The 2001 Census and 2007 CS has a reduced number of staff supporting the scanning technology working on shifts Duration The 1996 Census data capturing was planned for 12 Months The 2001 Census was planned for 6 months. However, the period was extended 18 months due to not tested new technology The 2007 CS took only 3 months as planned Systems design and specifications In 2001 Census, system specification & development was reviewed during implementation In 2007 CS, most the system specification & development were completed and tested before the production Planning phase
Strategic plan There is a policy on standard procedure in terms of documentation, process flow, metadata, concepts managed by DMID (Data Management and Information Delivery) project ; Common strategy across surveys program by using scanning technology with control of transaction in database Moving toward a Centralised Corporate data processing Centre ( store management,…) Accounting of production transaction tracking the questionnaire using a barcode; Measurement of quality at each process of the production (; Having a permanent team of data processors in order to keep the experience while build the capacity; Acceptance of any system or module into production after it has gone through testing phase to avoid the experience of 2001 Census of untested system; Planning phase
Operational plan & Budget Since 2001 Census, there is a detailed activities list, sub-activities and tasks with timelines (start and end date) and responsible persons; Since 2007 CS, each activity is linked to budget in what is called activity/task base costing; Since 2007 CS, there is an independent and dedicate team in charge of project management and monitoring of activities; A list of documents and other derivable are submitted to the project management team (PMO) to keep track of the progress; Development of performance indicators for PMO to track which will give the daily production counts per process; Based on activities costing, the budget has never been an issue, except in 2001 Census when the project went beyond the planned period. Planning phase
The data processing team get the user requirement from the questionnaire design team and data collection team; The team comprised by Data processors, system analyst (1 person), programmers, statisticians and Data technologist (IT technicians) prepare the overall design specifications; The data processing team is supplemented by the Data Collection team in the management of production and staff management on flow; The scanning module of the system is out source (in 2001 a consortium of companies , but in 2007 CS one company was accountable); In 2007 CS, the data processing project management was controlled in house to avoid the lack of accountability observed in 2001 Census where it was done by external (PROCON) Since the workflow was changing in 2001Census, a approved workflow with the operation procedure manual was ready in 2007 CS before the start of production The functional specifications where done only in 2007 CS as part of overall system specification; The technical specifications were completed for as build system in 2001 Census whereas the 2007 CS specification where done before any implementation. Design of Data Processing
In 1996 Census, the system development was done by in-house team supported by the Swedish consultants; In 2001 Census, the system development was outsourced to local based company that put together a consortium of service providers in project management, system development, scanner specialist/maintenance, Image and recognition software; In 2007 CS, the system development and project management was done in-house outsourcing only the scanning software and scanner maintenance; In 1996 Census, only unit test was conducted whereas the 2001 Census, most of the tests (unit tests, production load test,…) were conducted while in production already; In 2007 CS, all tests were done before production: For instance, the background colour drop out was tested in 2007 CS whereas the blue colour background in 2001 Census required a blue light in scanner (tested after months of production); The decision on exception handling was done during production in 2001 Census (rescan or transcription) whereas in 2007 CS, the questionnaire were send to Key From Paper (KFP) or Key From Image (KFI); In 2007 CS, false-positive reading were reduce by introducing voting rules between two different recognition engines whereas in 2001 Census all false-positive reading were sent to verification stage (Tiling and Completion/Key correction) System Development & Testing
Operational procedures In 2001 Census, operational procedure manual was prepared during production; In 2007 CS, the operational procedure was in place before training Every day production account is produced (extraction from Oracle database) Recruitment In 1996 Census, the production staff were selected based on keying speed only; In 2001 Census, the production staff were recruited based on each process requirement; In 2007 CS, the production staff have versatile skills as data processors and can move between processes depending on needs as determined by the flow manager. In 2001 Census, staff worked 24 hours, 7 days a week in 3 shifts. In 2007 CS, only one shift was managed to meet the deadline. Training IN 2001 Census, training was conducted by service provider (PROCON) whereas in 1996 Census and 2007 CS, the training was by the senior data processors, system developers and statistician who were part of the design team. Preparation of work environment In 1996 Census used 9 sites. In 2001 Census, one warehouse site and in 2007 CS, there were two sites (one for main storage and the other for the production. Site preparation including partitioning, hardware and networking installed one month before the end of Census field operation. Implementation & Operations
High Level Process Flow Operations cont…
Tracking the documents movement across processes Accounting of all transactions including the production staff login; Database driven (SyBase in 1996, Oracle in 2001 and 2007); Progress reporting per user, per function and per process Reporting gives the performance management (speed, time, production unit,…) Operations cont… Document Management System
Operations cont… Progress reporting
Operations cont… Progress reporting
Kodak 9520D Used in 2001 Census; Used in 2007 CS; Differential scanner feeding (pages by page and/or batches); Barcode recognition at scanning time Operation cont… Tool of scanning
Questionnaires transcription: Damaged Unscannable Inconsistent page numbering Unique identifier (barcode) Key From Paper (KFP): Poor image quality Faint writing Missing pages Wrong unique identifier (Enumerator Area, Dwelling Unit & Household Number) False-Positive reading: Poor software recognition Poor image quality Incomplete text (character) Unrecognized mark or character Failed quality checks: Quality rate below the threshold (95% accurate rate) Operation cont… Exceptions
In 1996 Census, the quality was implemented as part of double keying without any measurement attached to it; In 2001 Census, the quality was measured at scanning time (check image quality) and after data capturing (Key from Image of the sampled batches (the threshold was 97%); In 2007 CS, the sample of captured were subjected to second capture comparing with the first capture where the agreement rate was determined (the threshold was 95% reduced due to good image quality): For scanned cases: sample keyed from image and calculation of an agreement rate; For exceptional cases: 100% double keyed from Paper and calculation of agreement rate; Operation cont… Quality Assurance (QA)
After capturing, each questionnaire is accounted for linked to the geographical area (EA) and having the correct data structure (household, persons,….) before any export; In 1996 Census, the export process of captured data into SAS/ASCII for for post-capture process (editing and tabulation); In 2001 Census, the balancing process took longer because of lack of reference link to the EA of postal questionnaire (self-enumeration); In 2007 CS, a Census and Administration System (CSAS) assisted in getting the full account of the questionnaires linked to their referenced geography; Accounting or Balancing process
In 1996, the adopted strategy was not to impute any derived value. Only manual editing was allowed; In 2001 Census, based on editing specification with the assistance of US Bureau of Census, an automated editing was implemented using IMPS/CSpro. The 2007 CS follows the same approach used in 2001 Census. Different editing report with imputation rates were produced to an editing committee which come out with the rule to apply for correction; In 2001 Census and 2007 CS, limited manual editing were implemented; One of key editing rule is the removal of minimal processable cases caused by poor recognition or false-positive reading; Though the editing has been in ASCII, the output database is exported with in different formats (i.e. users driven: ASCII, Oracle, SAS, Oracle,…) linked with the metadata; Data validation & Editing
Since Stats SA policy is to give access to data users, the strategy is to put the Census data in different format to increase accessibility and promote data use; In 1996 Census, the output database was packaged in SuperCorss database and a set of aggregated databases put on CD for the users; In 2001 Census, the access to the data was increased by adding on the online processing tabulation tools (PX-Web), the SuperCross, reduced ASCII file,…. In 2007 CS, the data is also available in different format (SuperCross, ASCII file, PX-web and other map/chart linked tools The traditional reports are still produced based on tabulation plan/output reports Tabulation and output products
Improve the Quality of the Data Save Time Reduce Costs Benefit of scanning Technology