1 / 35

UNECA ACS ASSD

UNECA ACS ASSD. African Handbook on Census Data Processing, Analysis and Dissemination St. Georges Hotel, Pretoria 15 November 2009. Data Capture Methods. Traditional Key from Paper (KFP) Scanning model Key from Image (KFI) Optical Mark Recognition (OMR)

cally-lowe
Download Presentation

UNECA ACS ASSD

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UNECA ACSASSD African Handbook on Census Data Processing, Analysis and Dissemination St. Georges Hotel, Pretoria 15 November 2009

  2. Data Capture Methods • Traditional • Key from Paper (KFP) • Scanning model • Key from Image (KFI) • Optical Mark Recognition (OMR) • Optical Character Recognition (OCR) • Intelligent Character Recognition (ICR) • Intelligent recognition (IR) • Internet (IRS) • Handheld (PDA, Laptop, Net book etc.)

  3. Forms Type (Source) Structured Semi-Structured Unstructured

  4. Scanning Models

  5. OMR

  6. OMR • OMR is a technology that allows an input device (e.g. imaging scanner) to read hand-drawn marks such as small circles or squares on specially designed paper. OMR is captured by contrasting reflectivity at predetermined positions on a page.

  7. OMR • OMR information is converted from marks into the form of numbers or letters and put it into the computer. • There are two known methods of applying OMR technology in data processing, namely • Form based OMR, and • Image based OMR • In form based OMR, one works with a specialized document that contains timing tracks along one edge of the form to indicate to the scanner where to read for marks which look like black boxes on the top or bottom of a form. • In image based OMR, the scanned image is run through processing or interpret engines for a computer to electronically determine the mark received from the form. • In effect, form based OMR does the ‘reading’ of data at scan time, whilst image based OMR can apply the creation of data during any subsequent process. • Key difference, with form based OMR one cannot add fields for interpretation after scanning whilst with image based OMR, these can be added as and when required. However, with form based OMR, images can be saved during the scanning process and would require a KFI process for any further verification or exceptions management

  8. KFP, KFI, OMR

  9. OMR Advantages and Disadvantages • Advantages • Form based OMR is a data collection technology that does not require a recognition engine. Therefore it is fast, using minimum processing power to process forms and its costs are predictable and defined • OMR capture speeds range around 4000 forms per hour and one can process quite a lot within a short period of time. • Disadvantages • OMR cannot recognize hand-printed or machine-printed characters. • With OMR, images of forms are not captured by scanners so electronic retrieval is not possible. • Tick boxes may not be suitable for all types of questions • If a user wants to gather large amounts of text then OMR can complicate data collection. • There is also the possibility of missing data in the scanning process, incorrectly or unnumbered pages can lead to them being scanned in the wrong order.

  10. OMR Best Practices • The entire process must be tested: • Information Capture • Recognizing • Verifying Results • Questionnaire design and preparation is a critical aspect • Forms must be easily scannable and in a good condition at scan time otherwise transcription will be required • Enumerators must take particular care in filling out questionnaires • Completeness and consistency checks must be in place • Careful care must be taken for the condition of the Questionnaire (dust, humidity, transportation, etc)

  11. OMR Lessons Learnt • OMR, in any form can be extremely powerful tool for use in data processing of large surveys and censuses, however they need to be carefully controlled and managed • To achieve high accuracy, well structured design and good quality printing of forms is critical. This primarily brings to the fore the issue of costs as this printing can be extremely costly and limited geographically as service providers are far and few between. • Although OMR data is relatively accurate, it is important to do detailed testing and constant review of data being produced to ensure that the right fields are being read. One can do this via various methods like an independent comparison of OCR read values versus KFI based values from the same images. • Exceptions can also be easily corrected with images available on hand for correction.

  12. KFI

  13. KFI • The actual process of KFI is quite similar to that of KFP in that the data capturer still enters in data manually; however instead of capturing from a manual form, he/she captures data directly from an image.

  14. KFI Advantages and Disadvantages • Advantages • Preparatory time • Minimal time required to implement changes and modifications. • Online verification • A major advantage is the fact that verification of instruments occurs at the time of data entry and therefore errors and discrepancies can be picked up easily. However, this can be negated with data entry clerks independently changing content on the instrument to if the system hampers their performance due to constant error messages • Disadvantages • Production time • In KFP processes, no computer aided recognition occurs. Therefore, the data capturer will type each and every character as displayed on the questionnaire • Keying errors • Keying errors are bound to occur as each and every character of information is being captured manually. As capturers try to reach their targets and increase performance, errors will start to creep in. • Entry clerk changes data due to tight validation • If tight validation is put into place only allowing the clerk a set number of values for entry, any inconsistent information will be changed to the easiest value the clerk can select. In this way invalid and out of range data is not consistently edited and correct and results in data problems downstream.

  15. Example of multi-type form OCR ICR OMR 47

  16. Example of Census Form

  17. OCR/ICR

  18. OCR/ICR • With scanning technology steadily becoming cheaper and more accessible and advancements in the development of recognition algorithms, OCR and ICR technology have became the foundation of image and forms processing around the world. This was done via two primary methods, OCR and ICR. • OCR technology recognizes machine-printed characters on a form, whilst ICR technology recognizes handwritten characters on a form. OCR technology and the ability to read machine printed characters have largely been solved as accuracy thresholds are mainly between 99 and 100%. • Key difference between OCR and ICR is that OCR is more accurate than ICR due to the large amount of variations which occur in handwriting. Nevertheless, ICR is a great advancement in character recognition as there is virtually no limit on the types of data that can be collected and converted. Albeit, this needs to be done with great care and attention to editing and data confrontation to avoid problems

  19. OCR ICRSegmentation of text

  20. OCR ICRSegmentation of text Engine A + Engine B 3 1 2 2 4 3 0 8 9 1

  21. Types of Recognition Engines • Different types of OCR/ICR/OMR engines are used to recognize characters (numeric or alpha-numeric). Clear Image ParaScript KADMOS TISICR EXPERVISION AEG NESTOR LIGATURE A2iA RecoStar JustICR

  22. Majority Voting Rules : Engines ICR 1 ICR 2 ICR 3 ICR 4 3 3 8 3 Majority = 3 Unanimous = ?

  23. *oshua Jo*hu* J*sh*a Alpha Recognition - Voting ICR B ICR A ICR C VOTING Joshua

  24. False Positive Marking

  25. OCR/ICR Advantages and Disadvantages • Advantages • Recognition engines used with imaging can capture highly specialized data sets • Engines can be made to learn regional characteristics and its effects on handwriting • Large saving on resources (human and machine) due to computer assistance in 80% of keying processes. • OCR/ICR recognizes machine-printed or hand-printed characters. • Scanning and recognition allowed efficient management and planning for the rest of the processing workload • Quick retrieval of images for editing and reprocessing • Disadvantages • Technology is costly • May require significant manual intervention if not implemented properly • Additional workload to enumerators-ICR has severe limitations when it comes to human handwriting • Characters must be hand-printed/machine-printed with separate characters in boxes • Ineffective when dealing with cursive characters

  26. OCR/ICR Lessons Learnt • ICR/OCR is technology that can benefit data processing immensely. However it must be carefully designed and implemented to avoid problems creeping into the production cycle. • Algorithm development has improved over time and is getting much better, however if handwriting is poor, more data will be sent for correction and therefore resulting in greater workload for operators. • Forms design and proper printing is key to the process in being successful • Barcodes can play a vital part to proving a unique description to the form and instruments should be treated as forms before being treated as households.

  27. OCR/ICR QA/Exceptions • One of the major issues of ICR/OCR is the fact that one places trust in the processing engine that it is providing data that is of excellent quality and is a direct reproduction of the instrument. • Therefore it is vital to undertake QA processes on any OCR/ICR data to ensure that the conversion process was of adequate quality. This can be done by a sample based recapture of data in an independent system to ascertain a data quality rate or as the inverse the error rate. This can either be utilized a a measure of quality with further options of rejection to ensure that only acceptable levels of data is sent through the system. • For exceptions, it has been found that tracking and correcting small cases through a bulk system can prove to be problematic and it would be more advantageous to follow a KFP solution for all exceptions. In this way, the bulk production system runs and is not hampered by exceptions.

  28. Internet Data Collection

  29. Internet Data Collection • The most common methods of data collection for surveys and censuses are personal interviewing and self enumeration. The growing number of respondents with access to the Internet introduces a new data collection alternative that is likely to become increasingly important in the future. • Like computer assisted telephone and personal interviewing, computer assisted self interviewing using the Internet permits an interactive exchange with the respondent through intelligence built into the computer application. • While promising, Internet surveys also face a variety of challenges in survey coverage, in survey design, in security of confidential information, and in mastery of new and rapidly changing technologies

  30. Internet Data Collection • The most important deciding factor on whether internet data collection should be a viable alternative is the rate of internet penetration in the respective country. • Some countries have high penetration rates, like in Europe were some countries boast penetration rates of between 80 and 90 percent. However in Africa, where recent statistics indicate average internet penetration at around 6.7%, the internet can play an important part of a multi channel data collection system in Censuses and surveys

  31. Internet Data Collection • The functional requirements for Internet questionnaires describe an interactive application where interview questions are presented to the respondent and actions are taken based on the responses • The Internet consists of heterogeneous client hardware and software. The software or browser supports published and de facto standards which allow Web pages to be displayed and execute on the client computer. One needs to be careful to design an interface as simple and adaptable as possible such that it can be displayed correctly on any universal browser or web interface.

  32. Internet Data Collection • Since the Internet is a public network, security vulnerabilities exist. They include the following: • Eavesdropping, i. e., intermediaries can listen in on private conversations; • Theft, data stolen during the course of transmission or from a computer or network; and • Impersonation, a sender or receiver using a false identity for communication. • The NSO needs to address these issues to provide respondents with a secure and private method to use the Internet for data collection.

  33. Internet Data Collection • Security for Internet data collection had to be addressed at three levels: • (1) the security of communication between the respondent and the NSO; • (2) the security of respondent data at the NSO, and • (3) the security of the NSO network

  34. Internet Data Collection • Since Web data collection is in its infancy, this is only the beginning. • As Web technology matures, guidelines for Web questionnaire design will be further tested, standardized, and documented. • With these advances and increasing Web skills in the general public, respondents will find Web questionnaires increasingly easy to use. • The ease of use and intuitiveness of a Web questionnaire is important since we do not have the luxury of training the respondent. • The Web also offers the opportunity to use graphics, audio, and video to improve the overall interview experience for the respondent.

  35. Thank you… I reiterate… We still need your valuable inputs to make this document better….

More Related