140 likes | 430 Views
Sindhi Optical Character Recognition. سنڌي عڪسي اکرن جي سڃاڻپ. By: Mutee U Rahman Muhammad Rafi Waleed Butt. Total 15 main bodies are considered Due to complications diacritics are not considered Tesseract & Decision Tree training m odels generated and tested
E N D
Sindhi Optical Character Recognition سنڌي عڪسي اکرن جي سڃاڻپ By: Mutee U Rahman Muhammad Rafi Waleed Butt
Total 15 main bodies are considered • Due to complications diacritics are not considered • Tesseract & Decision Tree training models generated and tested • Accuracy calculated by counting generated correct ids Summary of the Project
15 main bodies • 35 Tokens of Training Data • 10 Tokens of Testing Data Data Description Data Set-I
56 random MBs Data Set-II
100% Accuracy on random data file Tsseract Accuracy Results on Data-Set II Data-File
Line Segment • Sample pages are given with different numbers of lines • All lines were extracted correctly -100% Preprocessing
Line Segment • Pages with different number of lines given for segmenting line • All lines were extracted correctly -100% • 100% Preprocessing
Syllable/Ligature Segmentation • From every page, we have successfully extracted syllable/ligature • Performance of syllable/ligature 80% Preprocessing
Main Body (MB) • We have selected 15 MB from Sindhi Alphabets • We have not able to isolate diacritics, hence the MB are not correctly identifiable. Preprocessing
Diacritics • We are not able to extract diacritics from the text. Preprocessing
Tesseract accuracy is 94.4% and DT accuracy is 86.7% on Dataset-I • On Dataset-II accuracy for Tesseract is 100% • Line Extraction 100%, Syllable 80% Conclusion