Mohamed Attia & Mohamed El-Mahallawy RDI ’s Meeting Room; Oct. 2007

بسم الله الرحمن الرحيم Presenting Results and Training Data of Expanded Evaluation Experiment ofHMM-Based Arabic Omni Font-Written OCR Mohamed Attia & Mohamed El-Mahallawy RDI ’s Meeting Room; Oct. 2007

Overall Results The Omni quality of an OCR system is measured by its capabilities at.. Assimilation: How good at recognizing pages (whose text contents are not included in the training data) printed in fonts represented in the training data. Ultimate predefined goal: WERA around 3% Generalization: How good at recognizing pages printed in fonts not represented in the training data Ultimate predefined goal: WERG around 3·WERA

Shape Size Error Analysis of Assimilation Test Regarding Font Shape/Size

These are the most frequent 17 mistakes that contribute to about 63.15% of WERG Error Analysis of Assimilation Test Regarding the Most Frequent Recognition Mistakes

Shape Size Error Analysis of Generalization TestRegarding Font Shape/Size Sample pages from 2 books of a typical quality have also been tried. The WERG of book#1 sample pages (1,700 words) is 11.70%, and that of book#2’s samples (1,100 words) is 7.25%.

These are the most frequent 19 mistakes that contribute to about 55.90% of WERG Error Analysis of Generalization Test Regarding the Most Frequent Recognition Mistakes

Training and Evaluation Data • 9 distinct fonts, with the significant writing size range of each, are used for training/building recognition models. 7 of them are MS-Windows ones and 2 fonts are Mac. ones. • At each size of each fonts; 25 different pages are used for training and other 5 different ones are used for assimilation test. This sums up to (9·6·25 = 1,350) pages ≈ 1,350·200 = 270,000 words ≈ 270,000·4 = 1,080,000 graphemes for training and (9·6·5=270) pages ≈ 54,000 words ≈ 216,000 graphemes for assimilation testing • 3 Mac. OS. fonts at their full size range are used for generalization test. • At each size of each font of these 3 fonts, 5 pages are used for generalization test. This sums up to (5·6·3=90) pages ≈ 18,000 words ≈ 72,000 graphemes for generalization testing

Effect of Language Model • Our language model is neither constrained by a certain lexicon nor by a set of linguistic rules; i.e. it is an open vocabulary language model. • Our statistical language model (SLM) is an m-Gram one built using Bayes_Good-Turing_Back-Off methodology. • The unit of our SLM is the grapheme; i.e. the ligature. • The order of the deployed SLM in our system is 2. • Our SLM is built from the NEMLAR raw text corpus with size of 550,000 words (≈ 2,200,000 graphemes) distributed over the 13 most significant domains of modern and heritage standard Arabic. • Deploying/neutralizing the SLM has the following effect on the realized WER of our system:

Appreciating How-Distinct are the Fonts Used for Training, and Assimilation & Generalization testing

Can our OCR System StatisticallyBuild Concepts of Font Shapes?A Case Study • Some fonts which are conceptually distinct from the ones comprising the training data, are very challenging to generalization testing; i.e. WERG>>WERA • Upon our first trial to run a generalization test, the recognition models are built from the 7 MS-Windows fonts and the testing data was composed of 3 Mac. OS fonts. Under these conditions we got the poor results of WERG≈35%≈11·WERA (WERG>>WERA) • After error analysis and some contemplation, we realized that Mac. OS fonts are built with different concepts not covered by the 7 MS-Windows fonts; e.g. connected dots, overlapping of the tails of some graphemes, …, etc. • After adding 2 Mac. OS fonts to introduce those concepts in the training data, we have achieved the dramatic enhancement of WERG=10.32%≈3.4·WERA Our OCR system can statistically build font shape concepts.

Current Parameters Setting and Computational Capacity Computational Capacity of the current pilot system: Runtime; Recognition phase: Some what slow but bearable. Offline; Training phase: Very slow! As per the experiment reported here.. Building the Codebook takes about 45 hours. Building the HMM ’s takes about 53 hours. - As our pilot system is built from a hybrid of off-the-shelf tools (some are voluntarily built), a professionally optimised s/w implementation of the system may save up to 50% of the training/recognition time. - Other 25% may be saved using more powerful contemporary hardware.

Conclusion • Is our OCR system truly omni? Yes It can both assimilate and generalize at a remarkable WER under tough training and testing data sets. In fact, the obtained WER ’s are the best reported in the published literature in this regard. • Is there a room for further enhancement? Yes Regarding Both: - Reducing WERG by building recognition models from more distinct fonts (esp. Mac. ones), sizes, and writing styles. - Reducing the training/recognition time by a professionally optimized re-build of the core system, as well as using a more powerful hardware.

Thank you for your kind attention To probe further, contact.. m_Atteya@RDI-eg.com Mahallaway@AAST.edu

Simplified Arabic

Mudir(MS-Windows)

Koufi(MS-Windows)

Traditional Arabic(MS-Windows)

Akhbar(MS-windows)

Tahoma(MS-Windows)

Courier new (MS-Windows)

Baghdad (Mac.)

Demashq(Mac.)

Nadeem(Mac.)

Naskh(Mac.)

Giza(Mac.)

Mohamed Attia & Mohamed El-Mahallawy RDI ’s Meeting Room; Oct. 2007