1 / 13

Sindhi Optical Character Recognition

Sindhi Optical Character Recognition. سنڌي عڪسي اکرن جي سڃاڻپ. By: Mutee U Rahman Muhammad Rafi Waleed Butt. Total 15 main bodies are considered Due to complications diacritics are not considered Tesseract & Decision Tree training m odels generated and tested

cheri
Download Presentation

Sindhi Optical Character Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sindhi Optical Character Recognition سنڌي عڪسي اکرن جي سڃاڻپ By: Mutee U Rahman Muhammad Rafi Waleed Butt

  2. Total 15 main bodies are considered • Due to complications diacritics are not considered • Tesseract & Decision Tree training models generated and tested • Accuracy calculated by counting generated correct ids Summary of the Project

  3. 15 main bodies • 35 Tokens of Training Data • 10 Tokens of Testing Data Data Description Data Set-I

  4. 56 random MBs Data Set-II

  5. Tesseract Recognition Results on Data-Set I (Test Data)

  6. 100% Accuracy on random data file Tsseract Accuracy Results on Data-Set II Data-File

  7. Decision Tree Results

  8. Line Segment • Sample pages are given with different numbers of lines • All lines were extracted correctly -100% Preprocessing

  9. Line Segment • Pages with different number of lines given for segmenting line • All lines were extracted correctly -100% • 100% Preprocessing

  10. Syllable/Ligature Segmentation • From every page, we have successfully extracted syllable/ligature • Performance of syllable/ligature 80% Preprocessing

  11. Main Body (MB) • We have selected 15 MB from Sindhi Alphabets • We have not able to isolate diacritics, hence the MB are not correctly identifiable. Preprocessing

  12. Diacritics • We are not able to extract diacritics from the text. Preprocessing

  13. Tesseract accuracy is 94.4% and DT accuracy is 86.7% on Dataset-I • On Dataset-II accuracy for Tesseract is 100% • Line Extraction 100%, Syllable 80% Conclusion

More Related